COMP 3000 Essay 1 2010 Question 6: Difference between revisions

From Soma-notes
Cha0s (talk | contribs)
No edit summary
Hesperus (talk | contribs)
Edited article, provided introduction, overview and Mars-Rover case.
Line 5: Line 5:
=Answer=
=Answer=


Check the discussion tab. -- Munther
=Introduction=


Introduction Paragraph:
Race conditions bugs have their fare share of notoriety in the history of software bugs. This may range from a piece of Java code causing the application to halt, to critical-life system failures that lead to fatal results. In this article, we will define race conditions, examine some of the most well known cases involving race conditions and explore some of the solution schemes and ways the industry have proposed to track and detect race conditions.
Introduces the question and gives some general background etc.


Paragraph 1:
=Overview=
Gives first example in detail
 
Paragraph 2:
Gives second example in detail
 
Paragraph 3:
Gives third example in detail
 
Paragraph 4:
Gives fourth example in detail
 
Paragraph 5:
Gives fifth example in detail
 
Conclusion:
Relates it all back together or something (never been good with conclusions)


Race conditions is the term used in situations in which two or more processes can access the same piece of data simutaneously and
the end result depends on the timing sequence of those processes. This end result can be quite hazardous leading to major system
failures.


The need to control those race conditions type of situations brings us to the concept of mutual exclusion (Mutex). Mutual exclusion is the idea of making sure processes access data in a serialized way. Meaning that, if process A for instance, happens to be executing or using a particular data structure (called a critical section), then no other process like B would be allowed to execute or use that very same data structure (critical section) until process A finishes executing or decides to leave the data structure. Common algorithms and techniques used in mutual exclusion include: locks, semaphores and monitors.


== Therac-25 ==
== Therac-25 ==
Line 93: Line 80:


The combination which caused the first system failure itself was due to three sagging power lines being tripped simultaneously. The three separate events attempted to execute on a shared state, causing no alarm to be raised and the system to fail.
The combination which caused the first system failure itself was due to three sagging power lines being tripped simultaneously. The three separate events attempted to execute on a shared state, causing no alarm to be raised and the system to fail.
== The NASA Mars-Rover ==
The NASA Mars-Rover incident is another well known case of system failure due to race conditions. The Mars-Rover is a six wheeled driven, four wheeled steered vehicle designed by NASA to navigate the surface of Mars in order to gather videos, images, samples or any possible data about the planet.
===Hardware design and architecture===
The vehicle's main operating equipment consists of a set of wide and narrow angled cameras and a collection of specialized spectrometers. This set of equipment which also includes motors and the power bus is wired to an electronics card cage called the rover equipment module (REM). The main computer was built around a RAD-6000 CPU (Rad6k), RAM and non-volatile memory (a combination of FLASH and EEPROM).
===Software design===
The autonomous operation of the flight software maintains the vehicle in the state needed to receive and act upon commands, execute sequences of commands when available, and collect and format data for transmission.
Other software modules handle certain engineering functions like power on/off of components, conducting communications, management of memory and resources; device health status and performance of sequence control. The more operational tasks include acquiring of images/videos, processing data, power instruments and carrying out the needed orders to drive the vehicle.
The main software records progress through the use of three primary log-file systems: event reports (EVRs), engineering data (EH&A) and data products.
===System anomalies and errors===
==Windows Blue-Screens-Of-Death==
==Conclusions==
==References==

Revision as of 06:11, 12 October 2010

Question

What are some examples of notable systems that have failed due to flawed efforts at mutual exclusion and/or race conditions? How significant was the failure in each case?

Answer

Introduction

Race conditions bugs have their fare share of notoriety in the history of software bugs. This may range from a piece of Java code causing the application to halt, to critical-life system failures that lead to fatal results. In this article, we will define race conditions, examine some of the most well known cases involving race conditions and explore some of the solution schemes and ways the industry have proposed to track and detect race conditions.

Overview

Race conditions is the term used in situations in which two or more processes can access the same piece of data simutaneously and the end result depends on the timing sequence of those processes. This end result can be quite hazardous leading to major system failures.

The need to control those race conditions type of situations brings us to the concept of mutual exclusion (Mutex). Mutual exclusion is the idea of making sure processes access data in a serialized way. Meaning that, if process A for instance, happens to be executing or using a particular data structure (called a critical section), then no other process like B would be allowed to execute or use that very same data structure (critical section) until process A finishes executing or decides to leave the data structure. Common algorithms and techniques used in mutual exclusion include: locks, semaphores and monitors.

Therac-25

(This is still very rough and needs work. Thought I would lay it out there as a starting point)

The Therac-25 was an x-ray machine developed in Canada by Atomic Energy of Canada Limited (AECL). The machine was used to treat people using radiation therapy. Between 1985 and 1987 six patients were given overdoses of radiation by the machine. Half these patients died due to the accident. The cause of the incidents has been traced back to a programming bug which caused a race-condition. The Therac-25 software was written by a single programmer in PDP-11 assembly language. Portions of code were reused from software in the previous Therac-6 and Therac-20 machines. The main portion of the code runs a function called “Treat” this function determins which of the programs 8 main subroutines it should be executing. The Keyboard handler task ran concurrently with “Treat”

The 8 main subroutines were:

Reset

Datent

Set Up Done

Set Up Test

Patient Treatment

Pause Treatment

Terminate Treatment

Date, Time, ID Changes


The Datent subroutine communicated with the keyboard hander task through a shared variable which signaled if the operator was finished entering the necessary data. Once the Datent subroutine sets the flag signifying the operator has entered the necessary information it allows the main program to move onto the next subroutine. If the flag was not set the “Treat” task reschedules itself in turn rescheduling the Datent subroutine. This continues until the shared data entry flag is set.


The Datent subroutine was also responsible for preparing the x-ray to administer the correct radiation dosage. The subroutine was setup so that before returning to “Treat” instructions to move on to the next of its 8 subroutines it would first call the “Magnet” subroutine. This subroutine parsed the operators input and moved the x-ray machines magnets into position to administer the prescribed radiation. This magnet subroutine took approximately 8 seconds to complete and while it ran the keyboard handler was also running. If the operator modified the data before the “magnet” subroutine returned their changes would not be register and the x-ray strength would already be set to its prior value ignoring the operator’s changes.


Hypothetical example situation:

-Operator types up data, presses return

-(Magnet subroutine is initiated)

-Operator realizes there is an extra 0 in the radiation intensity field -Operator moves cursor up and fixes the error and presses return again.

-Magnets are set to previous power level .subroutine returns

-Program moves on to next subroutine without registering changes

-Patient is administered a lethal overdose of radiation

Black-out of 2003

On August 14th, 2003, a massive power outage spread through the Northeastern and Midwestern United States and Canada. A generating plant in Eastlake, Ohio went offline, causing a domino affect ultimately leading to over 100 power plants shutting down.

There are several reasons that are attributed to this massive failure. One of the most prominent factors being a software bug in General Electric Energy's Unix-based XA/21 energy management system.

FirstEnergy's Akron, Ohio control center was responsible for monitoring the Eastlake plant. However, the software flaw caused the control center to be unable to receive any warning or alarm from the plants.

Because of this, the control center's ability to prevent the cascading effect after the Eastlake plant went offline.

The XA/21 bug was triggered through a unique combination of events and alarm conditions on the equipment it was monitoring. The main system failed, unable to handle the combination of requests. By the time the back-up server kicked in, the accumulation of events since the main system failure caused it to go down as well.

The system made no indication that it had failed, and the control center received no warnings about the fact that they were operating without an alarm system.

The combination which caused the first system failure itself was due to three sagging power lines being tripped simultaneously. The three separate events attempted to execute on a shared state, causing no alarm to be raised and the system to fail.

The NASA Mars-Rover

The NASA Mars-Rover incident is another well known case of system failure due to race conditions. The Mars-Rover is a six wheeled driven, four wheeled steered vehicle designed by NASA to navigate the surface of Mars in order to gather videos, images, samples or any possible data about the planet.

Hardware design and architecture

The vehicle's main operating equipment consists of a set of wide and narrow angled cameras and a collection of specialized spectrometers. This set of equipment which also includes motors and the power bus is wired to an electronics card cage called the rover equipment module (REM). The main computer was built around a RAD-6000 CPU (Rad6k), RAM and non-volatile memory (a combination of FLASH and EEPROM).

Software design

The autonomous operation of the flight software maintains the vehicle in the state needed to receive and act upon commands, execute sequences of commands when available, and collect and format data for transmission.

Other software modules handle certain engineering functions like power on/off of components, conducting communications, management of memory and resources; device health status and performance of sequence control. The more operational tasks include acquiring of images/videos, processing data, power instruments and carrying out the needed orders to drive the vehicle.

The main software records progress through the use of three primary log-file systems: event reports (EVRs), engineering data (EH&A) and data products.

System anomalies and errors

Windows Blue-Screens-Of-Death

Conclusions

References