COMP 3000 Essay 1 2010 Question 6
Question
What are some examples of notable systems that have failed due to flawed efforts at mutual exclusion and/or race conditions? How significant was the failure in each case?
Answer
Check the discussion tab. -- Munther
Introduction Paragraph: Introduces the question and gives some general background etc.
Paragraph 1: Gives first example in detail
Paragraph 2: Gives second example in detail
Paragraph 3: Gives third example in detail
Paragraph 4: Gives fourth example in detail
Paragraph 5: Gives fifth example in detail
Conclusion: Relates it all back together or something (never been good with conclusions)
Therac-25
(This is still very rough and needs work. Thought I would lay it out there as a starting point)
The Therac-25 was an x-ray machine developed in Canada by Atomic Energy of Canada Limited (AECL). The machine was used to treat people using radiation therapy. Between 1985 and 1987 six patients were given overdoses of radiation by the machine. Half these patients died due to the accident. The cause of the incidents has been traced back to a programming bug which caused a race-condition. The Therac-25 software was written by a single programmer in PDP-11 assembly language. Portions of code were reused from software in the previous Therac-6 and Therac-20 machines. The main portion of the code runs a function called “Treat” this function determins which of the programs 8 main subroutines it should be executing. The Keyboard handler task ran concurrently with “Treat”
The 8 main subroutines were:
Reset
Datent
Set Up Done
Set Up Test
Patient Treatment
Pause Treatment
Terminate Treatment
Date, Time, ID Changes
The Datent subroutine communicated with the keyboard hander task through a shared variable which signaled if the operator was finished entering the necessary data. Once the Datent subroutine sets the flag signifying the operator has entered the necessary information it allows the main program to move onto the next subroutine. If the flag was not set the “Treat” task reschedules itself in turn rescheduling the Datent subroutine. This continues until the shared data entry flag is set.
The Datent subroutine was also responsible for preparing the x-ray to administer the correct radiation dosage. The subroutine was setup so that before returning to “Treat” instructions to move on to the next of its 8 subroutines it would first call the “Magnet” subroutine. This subroutine parsed the operators input and moved the x-ray machines magnets into position to administer the prescribed radiation. This magnet subroutine took approximately 8 seconds to complete and while it ran the keyboard handler was also running. If the operator modified the data before the “magnet” subroutine returned their changes would not be register and the x-ray strength would already be set to its prior value ignoring the operator’s changes.
Hypothetical example situation:
-Operator types up data, presses return
-(Magnet subroutine is initiated)
-Operator realizes there is an extra 0 in the radiation intensity field -Operator moves cursor up and fixes the error and presses return again.
-Magnets are set to previous power level .subroutine returns
-Program moves on to next subroutine without registering changes
-Patient is administered a lethal overdose of radiation
Black-out of 2003
On August 14th, 2003, a massive power outage spread through the Northeastern and Midwestern United States and Canada. A generating plant in Eastlake, Ohio went offline, causing a domino affect ultimately leading to over 100 power plants shutting down.
There are several reasons that are attributed to this massive failure. One of the most prominent factors being a software bug in General Electric Energy's Unix-based XA/21 energy management system.
FirstEnergy's Akron, Ohio control center was responsible for monitoring the Eastlake plant. However, the software flaw caused the control center to be unable to receive any warning or alarm from the plants.
Because of this, the control center's ability to prevent the cascading effect after the Eastlake plant went offline.
The XA/21 bug was triggered through a unique combination of events and alarm conditions on the equipment it was monitoring. The main system failed, unable to handle the combination of requests. By the time the back-up server kicked in, the accumulation of events since the main system failure caused it to go down as well.
The system made no indication that it had failed, and the control center received no warnings about the fact that they were operating without an alarm system.
The combination which caused the first system failure itself was due to three sagging power lines being tripped simultaneously. The three separate events attempted to execute on a shared state, causing no alarm to be raised and the system to fail.