Difference between revisions of "COMP 3000 Essay 1 2010 Question 6"

From Soma-notes
Jump to navigation Jump to search
 
(38 intermediate revisions by 6 users not shown)
Line 4: Line 4:


=Answer=
=Answer=
=Overview=
A race condition occurs when two or more processes receive write access to shared data simultaneously. The end result leads to unpredictable results depending on the exact timing of those processes. Consequently a major system failure can occur.


=Introduction=
=Introduction=


Race conditions have their fare share of notoriety in the history of software bugs. This may range from a piece of Java code causing an application to halt, the corruption of web services, or the failure of a life-critical system with fatal consequences. In this article, we will define race conditions, examine some of the most well known cases involving race conditions. We will also take a look at some of the solution schemes and ways the industry have proposed to track and detect race conditions.
Race conditions are notorious in the history of software bugs. Examples range from a section of Java code causing an application to halt, the corruption of web services, to the failure of a life-critical system with fatal consequences. All of the system failures due to race conditions have common patterns and are caused by inadequate management of shared memory.  


=Overview=
During development of a system, the programmers may have no way of knowing a race condition exists until they occur. They are unexpected, infrequent, and the specific failure conditions are difficult to duplicate. Therefore, it may take weeks or even years for the origin of the failure to be discovered. This is especially true for complex systems. In the following examples, different situations in which race conditions have occurred will be discussed. The Therac-25 race condition shows that the implementation of code from an older system could have fatal results. The 2003 Black-out race condition highlights the need for testing when a system is implemented in a different environment. The Mars-Rover race condition shows the importance of testing of every possible combination of events when it is essential the system never fails. Finally, the BSoD race condition section shows that even a major system that has been implemented in millions of computers can still contain deeply embedded bugs not yet fixed.
 
In this essay, we will examine the most well-known cases involving race conditions. For each of the cases we will explain why the race condition occurred, its significance, and the aftermath of the failure.


A race condition occurs when two or more processes receive write access to shared data simultaneously. The end result may be incorrect
depending on the exact timing of those processes. Consequently a major system failure can occur. The main challenge with race condition errors is
that they're usually unpredictable and can be triggered in various ways depending on the processes involved and the surrounding environment, making it a nightmare for
the programmers to debug and track the error.


=Examples=
=Examples=
== Therac-25 ==
== Therac-25 ==


The Therac-25 was an x-ray machine developed in Canada by Atomic Energy of Canada Limited (AECL). The machine was used to treat people using radiation therapy. Between 1985 and 1987 six patients were given overdoses of radiation by the machine. Half these patients died due to the accident. The incident is quite possibly the most infamous software bug relating to race conditions. The cause of the incidents has been traced back to a programming bug which caused a race-condition.
The Therac-25 was an x-ray machine developed in Canada by Atomic Energy of Canada Limited (AECL). The machine was used to treat people using radiation therapy. Between 1985 and 1987 six patients were given overdoses of radiation by the machine. Half these patients died due to the accident. The incident is quite possibly the most infamous software bug related to race conditions. The cause of the incidents has been traced back to a programming bug which caused a race-condition.
The Therac-25 software was written by a single programmer in PDP-11 assembly language. Portions of code were reused from software in the previous Therac-6 and Therac-20 machines.  
The Therac-25 software was written by a single programmer in PDP-11 assembly language. Portions of code were reused from software in the previous Therac-6 and Therac-20 machines.  
The main portion of the code runs a function called “Treat” this function determins which of the programs 8 main subroutines it should be executing. The Keyboard handler task ran concurrently with “Treat”.
The main portion of the code runs a function called “Treat”. This function determines which of the program's 8 main subroutines it should be executing. The keyboard handler task runs concurrently with “Treat”.


===Main Subroutines===
===Main Subroutines===


The Therac-25 had 8 main subroutines it made use of. The Datent had its own helper routine called magnet which prepared the x-rays magnets to administer the correct dosage of radiation.
The following are the 8 main subroutines the Therac-25 made use of. The Datent (Data Entry) had its own helper routine called magnet which prepared the x-ray's magnets to administer the correct dosage of radiation.


#Reset
#Reset
#Datent
#Datent
##Magnet
#*Magnet
#Set Up Done
#Set Up Done
#Set Up Test
#Set Up Test
Line 38: Line 40:




The Datent subroutine communicated with the keyboard hander task through a shared variable which signaled if the operator was finished entering the necessary data. Once the Datent subroutine sets the flag signifying the operator has entered the necessary information it allows the main program to move onto the next subroutine. If the flag was not set the “Treat” task reschedules itself in turn rescheduling the Datent subroutine. This continues until the shared data entry flag is set.
The Datent subroutine communicated with the keyboard handler task through a shared variable which signaled if the operator was finished entering the necessary data. Once the Datent subroutine sets the flag signifying the operator has entered the necessary information, it allows the main program to move on to the next subroutine. If the flag was not set, the “Treat” task reschedules itself, in turn rescheduling the Datent subroutine. This continues until the shared data entry flag is set.




The Datent subroutine was also responsible for preparing the x-ray to administer the correct radiation dosage. The subroutine was setup so that before returning to “Treat” instructions to move on to the next of its 8 subroutines it would first call the “Magnet” subroutine. This subroutine parsed the operators input and moved the x-ray machines magnets into position to administer the prescribed radiation. This magnet subroutine took approximately 8 seconds to complete and while it ran the keyboard handler was also running. If the operator modified the data before the “magnet” subroutine returned their changes would not be register and the x-ray strength would already be set to its prior value ignoring the operator’s changes.
The Datent subroutine was also responsible for preparing the x-ray to administer the correct radiation dosage. The subroutine was set up so that before returning to “Treat” to give instructions to move on to the next of its 8 subroutines, it would first call the “Magnet” subroutine. This subroutine parsed the operator's input and moved the x-ray machine's magnets into position to administer the prescribed radiation. This magnet subroutine took approximately 8 seconds to complete and while it ran, the keyboard handler was also running. If the operator modified the data before the “magnet” subroutine returned, their changes would not be register and the x-ray strength would already be set to its prior value, ignoring the changes the operator inputted.




Line 49: Line 51:
The situation below illustrates a chain of events that would result in an unintended dose of radiation being administered.
The situation below illustrates a chain of events that would result in an unintended dose of radiation being administered.


#Operator types up data, presses return
#Operator types up data, presses return.
#(Magnet subroutine is initiated)
#Magnet subroutine is initiated.
#Operator realizes there is an extra 0 in the radiation intensity field
#Operator realizes there is an extra 0 in the radiation intensity field.
#Operator quickly moves cursor up and fixes the error and presses return again.
#Operator quickly moves cursor up and fixes the error and presses return again.
#Magnets are set to previous power level .subroutine returns  
#Magnets are set to previous power level. Subroutine returns.
#Program moves on to next subroutine without registering changes
#Program moves on to next subroutine without registering changes.
#Patient is administered a lethal overdose of radiation
#Patient is administered a lethal overdose of radiation.
 
 
===Root Causes & Outcomes===
 
A number of factors contributed to the failure of the Therac-25. The code was put together by a single programmer and no proper testing was conducted. In addition, code was reused from the previous generation of Therac machines without first verifying that it was fully compatible with the new hardware. Previous Therac-6 and Therac-20 had hardware interrupts which prevented race conditions from occurring. It is clear that proper planning and forethought could have prevented this incident.
 
Six incidents involving the Therac-25 took place over the span 1985 and 1987. It took 2 years until the FDA took the machines out of service. The FDA forced AECL to make modifications to the Therac-25 before it was allowed back on the market. Software bugs were fixed to suspend all other operations while the magnets positioned themselves to administer the correct radiation strength. In addition, a dead man's switch was added. The switch was a foot pedal that the operator must hold down to enable motion of the x-ray machine. This prevented the operator from being unaware of changes in the x-ray machine's state.
 
After these changes were made, the Therac-25 was reintroduced into the market in 1988. Some of the machines are still in service today.




== Black-out of 2003 ==
== Black-out of 2003 ==


An energy management system failed due to a race condition, ultimately leading to Ontario and parts of the United States experiencing a black-out.
On August 14th, 2003, a massive black out affected Ontario and the North Eastern United States. It all began with a power plant, located in Eastlake, Ohio, going off-line. This occurred doing a time of high electric demand, meaning that the power would have to come from elsewhere. Three power lines began to sag due to the greater strain put on them, causing them to come into contact with overgrown trees. When a power line comes in contact with a tree, as these three did, they are automatically shut off. With these three power lines shutting off, more strain was put on other power lines. When a power line is overloaded, it also shuts off. The result was a cascading effect ultimately leading to 256 power plants going off-line [1].


The incident occured on August 14th, 2003, when a power plant located in Eastlake, Ohio went offline. The system was set up so that if this were to occur, a warning would be sent to FirstEnergy's control center in Akron, Ohio. Upon recieving this warning, power would be re-routed through other plants to isolate the failure.However, no warning was recieved, resulting in a domino effect causing ultimately over 100 power plants to go offline.
FirstEnergy's control center in Akron, Ohio was responsible for balancing the load of these power lines. However, they were not receiving any warnings because their energy management system had silently crashed. The control center operators were unable to receive warnings, and had no idea that they weren't receiving them. The energy management system in question is the Unix-based XA/21, created by General Electric. A software flaw in the energy management system caused the system failure, resulting in the control center operators being unaware of power imbalances, and therefore unable to prevent the blackout [1].
 
FirstEnergy at the time was using General Eletric's Unix-based XA/21 energy management system. This system was responsible for alerting the operators of the control center whenever there was a problem. Unfortunately, a flaw in the software caused the system to crash.The energy management system crashed silently, so that the operators at the control center had no idea they were not receiving alerts the otherwise would be. Without any warnings, the operators had no idea the power plant went offline, and so took no measures to prevent the cascading effect leading to the black-out.
   
   
===Cause of Race Condition===
===Cause of Race Condition===


The XA/21 energy management system failed due to three sagging power lines being tripped simultaneously. These three seperate events then attempted to execute on a shared state, causing the main system to fail. A back-up server went online to attempt to handle the requests. By the time it kicked in the accumulation of events since the main system failure caused the back-up to fail as well.
After it was revealed that the XA/21 system crash was responsible for the control center not receiving alerts, an investigation was launched to determine the cause. After 8 weeks of testing, GE was able to recreate the the unique combination of events and alarm conditions that triggered the bug. A race condition was discovered, as two processes came into contention over the same data structure. Through a flaw in one of the processes' coding, both processes were able to get write access to the data structure. This corrupted the data necessary to trigger an alarm, sending the alarm event into an infinite loop [1].
 
Because the alarm event had crashed, events were unable to be processed as they came into the control center. The build up of events caused the energy management system's server to go down within thirty minutes of the alarm event crash. A backup server kicked in to attempt to manage the load, but by then there were too many events cued up to handle, and the backup server went down as well [1].


===Aftermath===
===Aftermath===
With the system failure that ultimately led to 256 plants going offline, a massive black-out was experienced in North Eastern USA and Ontario. It is estimated that 55 million people were effected by the black-out. Investigations in the aftermath revealed both negligence on FirstEnergy's part and revealed the deeply embedded bug within the XA/21 energy management system. The bug has since been fixed with a patch.
 
The cascading effect started by the three downed power lines eventually reached the New York City power grid, leaving an estimated 55 million people without power for two days. New York in particular felt the immediate effects of the blackout, with 3000 fires being reported, and emergency services responding to twice the average amount of distress calls. Eleven fatalities have been contributed to the blackout. It has been estimated that the total cost of the blackout was 6 billion dollars [2].
 
The major impacts the blackout created was that it revealed a very major yet subtle bug, as well as exposing the shortcomings of the power grid at that time. GE has since released a patch that fixes the bug, as well as instruction for properly installing the system [1]. The US-Canada Power System Outage Task Force released a report that included 46 recommendations to prevent future blackouts [3]. Congress, upon receiving this report, passed the Energy Policy Act of 2005 [2]. Among the standards that have been put forth are the need for operators at control centers to be trained to deal with critical events, a requirement for trees to be kept clear of transmission lines, and for any system involved in grid operations to be able to handle a power line fault as well as any other failures that could otherwise endanger the grid [4].
 
All in all, there has been no conclusive evidence that these changes have helped prevent blackouts, as the numbers have been fairly stable [5]. However, with these new standards, it is unlikely the events of the 2003 blackout will reoccur in the United States.


== The NASA Mars-Rover ==
== The NASA Mars-Rover ==
The NASA Mars-Rover incident is another well known case of system failure due to race conditions. The Mars-Rover is a six wheeled driven, four wheeled steered vehicle designed by NASA to navigate the surface of Mars in order to gather videos, images, samples or and possible data about the planet. NASA landed two Rover vehicles, the Spirit and Opportunity Rovers, on January 4 and January 25, 2004, respectively. The Rover was controlled on a daily basis by the NASA team on earth by sending messages and tasks. Each solar day in the life of the Rover is called a Sol.
The NASA Mars-Rover incident is another well known case of system failure due to race conditions. The Mars-Rover was a six-wheeled driven, four-wheeled steered vehicle designed by NASA to navigate the surface of Mars in order to gather videos, images, samples and other possible data about the planet. NASA landed two Rover vehicles, the Spirit and Opportunity on January 4th and January 25th, 2004, respectively. The Rovers were controlled on a daily basis by the NASA team on earth by sending messages and tasks.


===Hardware design and architecture===
===Hardware design and architecture===
The vehicle's main operating equipment consists of a set of high-resolution cameras, a collection of specialized spectrometers and a set of radio antennas for transmitting and receiving data. The main computer was built around a BAE RAD-6000 CPU (Rad6k), RAM and non-volatile memory (a combination of FLASH and ROM).  
The vehicle's main operating equipment consisted of a set of high-resolution cameras, a collection of specialized spectrometers and a set of radio antennas for transmitting and receiving data. The main computer was built around a BAE RAD-6000 CPU (Rad6k), RAM and non-volatile memory (a combination of FLASH and ROM).  


===Software design===
===Software design===
The Rover is controlled by the VxWorks real-time operating system.  The Rover flight software was mostly implemented in ANSI C, with some fragements of code written in C++ and assembly.  
The Rovers were controlled by a VxWorks real-time operating system.  The Rover flight software was mostly implemented in ANSI C, with some fragments of code written in C++ and assembly.  
The rover relied on an autonomous system that enabled it to drive itself and carry out a number of self-maintenance operations. The system implements a time-multiplexing system, where all processes share and access resources on the single CPU. The Rover records progress through the use of three primary log-file systems: event reports (EVRs), engineering data (EH&A) and data products.
The Rovers relied on an autonomous system that enabled them to drive themselves and carry out a number of self-maintenance operations. The system implemented a time-multiplexing system, where all processes share and access resources on the single CPU. The Rovers recorded progress through the use of three primary log-file systems: event reports (EVRs), engineering data (EH&A) and data products.


===System failures and vulnerabilities===
===System failures and vulnerabilities===
The first race-condition bug occured in the Spirit Rover Sol 131. The initilazation module (IM) process was preparing to increment a counter that keeps track of the number of times an initilazation occured, in order to do that, the IM process must request permission and be granted access to write that counter to memory (critical section). While requesting the permission, another process was granted access to use that very same piece of memory (critical section). This resulted in the IM process generating a fatal exception through its EVR log. The exception lead to loss and trouble in transmitting data to the NASA team on earth, which eventually led to
The first race-condition bug occurred with the Spirit Rover Sol 131 (Each solar day in the life of the Rover is called a Sol). The initialization module (IM) process was preparing to increment a counter that keeps track of the number of times an initialization occurred. In order to do that, the IM process must request permission and be granted access to write that counter to memory (critical section). While requesting the permission, another process was granted access to use that very same piece of memory (critical section). This resulted in the IM process generating a fatal exception through its EVR log. The exception lead to loss and trouble in transmitting data to the NASA team on earth, which eventually led to the Rover being in a halted state for a few days. In efforts to keep the Rover functioning, the NASA team attempted to avoid the problem by restricting another module from operating during that time-frame, allowing enough time for the IM process to carry on its task. However, the NASA team were aware of the fact that the bug could actually resurface again. And it actually did later on in the Spirit Rover Sol 209 and then on the Opportunity Rover on Sol 596 and Sol 622.
the Rover being in a halt state for a few days. The NASA team attempted to solve the problem by rebooting the Rover and restricting another module from operating during that time-frame. However, the same bug reoccured in the Spirit Rover on Sol 209 and then on the Opportunity Rover on Sol 596 and Sol 622.
 
A similar type of error occurred on the Spirit Sol 136, this time the Imaging Services Module (IMG) was involved. Just as the NASA team requested data from the Rover to be transmitted, the IMG was beginning a deactivation state, the IMG reading cycles from memory were suddenly interrupted by the deactivation process which was attempting to power off the piece of memory associated with the IMG reading task. This resulted in a failure to return the requested data from the Rover.  


==Windows Blue-Screens-Of-Death==
A similar type of error occurred on the Spirit Sol 136, this time the Imaging Services Module (IMG) was involved. Just as the NASA team requested data from the Rover to be transmitted, the IMG started a deactivation state. The IMG reading cycles from memory were suddenly interrupted by the deactivation process which was attempting to power off the piece of memory associated with the IMG reading task. This resulted in a failure to return the requested data from the Rover.


When a problem in Windows forces the operation systems to fail, the computer often displays an error screen, know as Stop message, that describes the cause of the problem, most people called this a Blue Screen of Death (BSOD).
===Aftermath and current status===
While those race conditions errors were clearly due to a lack of memory management and proper co-ordination among processes, they were largely unexpected and unforeseen. In contrast to the other cases mentioned so far, the consequences that the NASA team had to deal with weren't life threatening. So it seems that their main concern was to keep the Rovers functioning in order to obtain as much information as possible. No effort was even made to alter the software. Also, one could imagine that the task of examining and debugging those errors was quite a challenge, since they couldn't deal with the Rovers physically, rather everything was done via transmission and messages. Another thing to note is the fact that the single CPU used in those Rovers had a lot to deal with beside the usual software implementation. Had NASA considered the possibility of implementing a multiple CPU design, things could have been different.


The error 0X0000001a, MEMORY_MANAGEMENT, occurs because of the race condition of memory management. It is a hardware error related to memory management. It is possible that the computer can not timely get enough power to the memory for the process.  
The Spirit Rover has experienced a number of problems since then. Most recent reports revealed that the Rover has been largely inactive, with no data being received from the Rover. The Opportunity Rover on the other hand continues to function successfully.


The BSOD has surfaced on a number of Windows versions including Windows 7. It has also caused system failures in airports, ATM machines and street hoardings. However, the most notable public incident happened on the opening ceremony of the 2008 Beijing Summer Olympics in China, when one of the projectors crashed because of a BSOD bug.
==Windows Blue-Screens-Of-Death==
When a problem in Windows forces the operating system to fail, the computer will often display an error screen, officially known as a "Stop Error". The message displays what exactly went wrong in white font to a blue background. Because of the color scheme chosen, this screen as received the title "Blue Screen of Death" or BSoD in short. While BSoDs cover a much wider range of errors than simply race conditions, it is one of the more prominent errors.


=Conclusions=
The error 0x0000001a, MEMORY_MANAGEMENT, occurs due to a race condition involving memory management on the system.
The need to control race conditions and maintain concurrency and safe sharing of resources among
processes brings us to the concept of mutual exclusion (Mutex). Mutual exclusion is the idea of making sure
processes access data in a serialized way. Meaning that, if process A for instance, happens to be executing or
using a particular data structure (called a critical section), then no other process like B would be allowed
to execute or use that very same data structure (critical section) until process A finishes executing or decides
to leave the data structure. Common algorithms and techniques used to establish mutual exclusion include locks, semaphores and monitors.


A handful of commercial software tools have been developed to address and detect race conditions errors as well. More recently, a US software company that goes by the name of ReplaySolutions has been awarded a patent from the US government for developing an innovative kit for debugging race conditions found in software.
BSoDs have been experienced by almost anyone who has used OS's like XP or Vista. There have also been more critical systems, such as airports and ATM machines that have also experienced such errors. A particularly public demonstration of the BSoD in action took place at the 2008 Beijing Summer Olympics in China, where one of the projectors crashed during the opening ceremonies.


As the industry strives for faster and more efficient level of performance through the use of multi-processor systems and multi-core chips, this area continues to be a vast field for research and innovation within the computing world.
=Conclusion=


Obviously, there is a wide range of problems that can occur due to race conditions. The problem is that any system could have a race condition bug, and it is impossible to simulate how the software will react to every possible hardware setup it might be coupled with. The examples above show that it is very important that life-critical and essential systems should have a higher standard of testing. With the incident involving the Therac-25, it is clear that legacy code should be thoroughly tested when implemented in a new system. With the 2003 blackout, it is clear that when installing in a new environment, the users should put the system through adequate testing themselves before using it in real-world applications. The Mars-Rover incident shows that even without implementation of old code or different environments race conditions can go unnoticed, bringing light to the fact that every possible combination of events should be tested, when failure could end up costing millions of dollars. Of course, on the more common side of things, the BSOD shows that even systems used in millions of computers are still capable of having deeply embedded bugs not yet fixed. So, while race conditions can not always be avoided, a higher standard of testing would greatly decrease the probability of it occurring.


=References=
=References=
* Nancy Leveson. July 1993. [http://sunnyday.mit.edu/papers/therac.pdf Medical Devices: The Therac-25]   
* Nancy Leveson. July 1993. [http://sunnyday.mit.edu/papers/therac.pdf Medical Devices: The Therac-25]
* Nancy Leveson and Clark Turner. July 1993. [http://www.stanford.edu/class/cs240/readings/therac-25.pdf An Investigation of the Therac-25 Accidents]   
* Anne Marie Porrello. July 1993. [http://users.csc.calpoly.edu/~jdalbey/SWE/Papers/THERAC25.html Death and Denial: The Failure of the THERAC-25, A Medical Linear Accelerator]
* Tracking the blackout bug. Kevin Poulsen. April 2004. [http://www.securityfocus.com/news/8412]
* The 2003 Northeast Blackout - 5 Years Later. JR Minkel. August 2008. [http://www.scientificamerican.com/article.cfm?id=2003-blackout-five-years-later]
* Final Report on the August 14th Blackout in the United States and Canada. US-Canada Power System Outage Task Force. [https://reports.energy.gov/]
* Reliability Standards. North American Electric Reliability Corporation. [http://www.nerc.com/page.php?cid=2|20]
* Trends in the History of Large Blackouts in the United States. Paul Hines, Jay Apt, and Sarosh Talukdar. January 2008. [http://wpweb2.tepper.cmu.edu/ceic/papers/ceic-08-01.asp]
* Reeves and Snyder. 10 January 2006. [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1571113&userType=inst An Overview of the Mars Exploration Rovers' Flight Software]. [http://trs-new.jpl.nasa.gov/dspace/bitstream/2014/37499/1/05-0539.pdf another source]
* Reeves and Snyder. 10 January 2006. [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1571113&userType=inst An Overview of the Mars Exploration Rovers' Flight Software]. [http://trs-new.jpl.nasa.gov/dspace/bitstream/2014/37499/1/05-0539.pdf another source]
* Matijevic and E. Dewell. 2006 [http://trs-new.jpl.nasa.gov/dspace/bitstream/2014/39897/1/06-0922.pdf Anomaly Recovery and the Mars Exploration Rovers]
* Matijevic and E. Dewell. 2006 [http://trs-new.jpl.nasa.gov/dspace/bitstream/2014/39897/1/06-0922.pdf Anomaly Recovery and the Mars Exploration Rovers]
* Update: Spirit and Opportunity [http://marsrover.nasa.gov/mission/status.html]
* It's Never Done That Before: A Guide to Troubleshooting Windows XP, John Ross, No Starch Press, 2006
* John Chan. 12 August 2008. Dreaded Blue Screen of Death strikes Olympics [http://news.cnet.com/8301-17938_105-10015872-1.html]  
* John Chan. 12 August 2008. Dreaded Blue Screen of Death strikes Olympics [http://news.cnet.com/8301-17938_105-10015872-1.html]  
* Patent Awarded for Debugging Race Conditions [http://www.drdobbs.com/tools/225600068]
* Dr. Dobb's Journal. 9 June 2010. Patent Awarded for Debugging Race Conditions [http://www.drdobbs.com/tools/225600068]

Latest revision as of 11:33, 8 November 2010

Question

What are some examples of notable systems that have failed due to flawed efforts at mutual exclusion and/or race conditions? How significant was the failure in each case?

Answer

Overview

A race condition occurs when two or more processes receive write access to shared data simultaneously. The end result leads to unpredictable results depending on the exact timing of those processes. Consequently a major system failure can occur.

Introduction

Race conditions are notorious in the history of software bugs. Examples range from a section of Java code causing an application to halt, the corruption of web services, to the failure of a life-critical system with fatal consequences. All of the system failures due to race conditions have common patterns and are caused by inadequate management of shared memory.

During development of a system, the programmers may have no way of knowing a race condition exists until they occur. They are unexpected, infrequent, and the specific failure conditions are difficult to duplicate. Therefore, it may take weeks or even years for the origin of the failure to be discovered. This is especially true for complex systems. In the following examples, different situations in which race conditions have occurred will be discussed. The Therac-25 race condition shows that the implementation of code from an older system could have fatal results. The 2003 Black-out race condition highlights the need for testing when a system is implemented in a different environment. The Mars-Rover race condition shows the importance of testing of every possible combination of events when it is essential the system never fails. Finally, the BSoD race condition section shows that even a major system that has been implemented in millions of computers can still contain deeply embedded bugs not yet fixed.

In this essay, we will examine the most well-known cases involving race conditions. For each of the cases we will explain why the race condition occurred, its significance, and the aftermath of the failure.


Examples

Therac-25

The Therac-25 was an x-ray machine developed in Canada by Atomic Energy of Canada Limited (AECL). The machine was used to treat people using radiation therapy. Between 1985 and 1987 six patients were given overdoses of radiation by the machine. Half these patients died due to the accident. The incident is quite possibly the most infamous software bug related to race conditions. The cause of the incidents has been traced back to a programming bug which caused a race-condition. The Therac-25 software was written by a single programmer in PDP-11 assembly language. Portions of code were reused from software in the previous Therac-6 and Therac-20 machines. The main portion of the code runs a function called “Treat”. This function determines which of the program's 8 main subroutines it should be executing. The keyboard handler task runs concurrently with “Treat”.

Main Subroutines

The following are the 8 main subroutines the Therac-25 made use of. The Datent (Data Entry) had its own helper routine called magnet which prepared the x-ray's magnets to administer the correct dosage of radiation.

  1. Reset
  2. Datent
    • Magnet
  3. Set Up Done
  4. Set Up Test
  5. Patient Treatment
  6. Pause Treatment
  7. Terminate Treatment
  8. Date, Time, ID Changes


The Datent subroutine communicated with the keyboard handler task through a shared variable which signaled if the operator was finished entering the necessary data. Once the Datent subroutine sets the flag signifying the operator has entered the necessary information, it allows the main program to move on to the next subroutine. If the flag was not set, the “Treat” task reschedules itself, in turn rescheduling the Datent subroutine. This continues until the shared data entry flag is set.


The Datent subroutine was also responsible for preparing the x-ray to administer the correct radiation dosage. The subroutine was set up so that before returning to “Treat” to give instructions to move on to the next of its 8 subroutines, it would first call the “Magnet” subroutine. This subroutine parsed the operator's input and moved the x-ray machine's magnets into position to administer the prescribed radiation. This magnet subroutine took approximately 8 seconds to complete and while it ran, the keyboard handler was also running. If the operator modified the data before the “magnet” subroutine returned, their changes would not be register and the x-ray strength would already be set to its prior value, ignoring the changes the operator inputted.


Example Bug Situation

The situation below illustrates a chain of events that would result in an unintended dose of radiation being administered.

  1. Operator types up data, presses return.
  2. Magnet subroutine is initiated.
  3. Operator realizes there is an extra 0 in the radiation intensity field.
  4. Operator quickly moves cursor up and fixes the error and presses return again.
  5. Magnets are set to previous power level. Subroutine returns.
  6. Program moves on to next subroutine without registering changes.
  7. Patient is administered a lethal overdose of radiation.


Root Causes & Outcomes

A number of factors contributed to the failure of the Therac-25. The code was put together by a single programmer and no proper testing was conducted. In addition, code was reused from the previous generation of Therac machines without first verifying that it was fully compatible with the new hardware. Previous Therac-6 and Therac-20 had hardware interrupts which prevented race conditions from occurring. It is clear that proper planning and forethought could have prevented this incident.

Six incidents involving the Therac-25 took place over the span 1985 and 1987. It took 2 years until the FDA took the machines out of service. The FDA forced AECL to make modifications to the Therac-25 before it was allowed back on the market. Software bugs were fixed to suspend all other operations while the magnets positioned themselves to administer the correct radiation strength. In addition, a dead man's switch was added. The switch was a foot pedal that the operator must hold down to enable motion of the x-ray machine. This prevented the operator from being unaware of changes in the x-ray machine's state.

After these changes were made, the Therac-25 was reintroduced into the market in 1988. Some of the machines are still in service today.


Black-out of 2003

On August 14th, 2003, a massive black out affected Ontario and the North Eastern United States. It all began with a power plant, located in Eastlake, Ohio, going off-line. This occurred doing a time of high electric demand, meaning that the power would have to come from elsewhere. Three power lines began to sag due to the greater strain put on them, causing them to come into contact with overgrown trees. When a power line comes in contact with a tree, as these three did, they are automatically shut off. With these three power lines shutting off, more strain was put on other power lines. When a power line is overloaded, it also shuts off. The result was a cascading effect ultimately leading to 256 power plants going off-line [1].

FirstEnergy's control center in Akron, Ohio was responsible for balancing the load of these power lines. However, they were not receiving any warnings because their energy management system had silently crashed. The control center operators were unable to receive warnings, and had no idea that they weren't receiving them. The energy management system in question is the Unix-based XA/21, created by General Electric. A software flaw in the energy management system caused the system failure, resulting in the control center operators being unaware of power imbalances, and therefore unable to prevent the blackout [1].

Cause of Race Condition

After it was revealed that the XA/21 system crash was responsible for the control center not receiving alerts, an investigation was launched to determine the cause. After 8 weeks of testing, GE was able to recreate the the unique combination of events and alarm conditions that triggered the bug. A race condition was discovered, as two processes came into contention over the same data structure. Through a flaw in one of the processes' coding, both processes were able to get write access to the data structure. This corrupted the data necessary to trigger an alarm, sending the alarm event into an infinite loop [1].

Because the alarm event had crashed, events were unable to be processed as they came into the control center. The build up of events caused the energy management system's server to go down within thirty minutes of the alarm event crash. A backup server kicked in to attempt to manage the load, but by then there were too many events cued up to handle, and the backup server went down as well [1].

Aftermath

The cascading effect started by the three downed power lines eventually reached the New York City power grid, leaving an estimated 55 million people without power for two days. New York in particular felt the immediate effects of the blackout, with 3000 fires being reported, and emergency services responding to twice the average amount of distress calls. Eleven fatalities have been contributed to the blackout. It has been estimated that the total cost of the blackout was 6 billion dollars [2].

The major impacts the blackout created was that it revealed a very major yet subtle bug, as well as exposing the shortcomings of the power grid at that time. GE has since released a patch that fixes the bug, as well as instruction for properly installing the system [1]. The US-Canada Power System Outage Task Force released a report that included 46 recommendations to prevent future blackouts [3]. Congress, upon receiving this report, passed the Energy Policy Act of 2005 [2]. Among the standards that have been put forth are the need for operators at control centers to be trained to deal with critical events, a requirement for trees to be kept clear of transmission lines, and for any system involved in grid operations to be able to handle a power line fault as well as any other failures that could otherwise endanger the grid [4].

All in all, there has been no conclusive evidence that these changes have helped prevent blackouts, as the numbers have been fairly stable [5]. However, with these new standards, it is unlikely the events of the 2003 blackout will reoccur in the United States.

The NASA Mars-Rover

The NASA Mars-Rover incident is another well known case of system failure due to race conditions. The Mars-Rover was a six-wheeled driven, four-wheeled steered vehicle designed by NASA to navigate the surface of Mars in order to gather videos, images, samples and other possible data about the planet. NASA landed two Rover vehicles, the Spirit and Opportunity on January 4th and January 25th, 2004, respectively. The Rovers were controlled on a daily basis by the NASA team on earth by sending messages and tasks.

Hardware design and architecture

The vehicle's main operating equipment consisted of a set of high-resolution cameras, a collection of specialized spectrometers and a set of radio antennas for transmitting and receiving data. The main computer was built around a BAE RAD-6000 CPU (Rad6k), RAM and non-volatile memory (a combination of FLASH and ROM).

Software design

The Rovers were controlled by a VxWorks real-time operating system. The Rover flight software was mostly implemented in ANSI C, with some fragments of code written in C++ and assembly. The Rovers relied on an autonomous system that enabled them to drive themselves and carry out a number of self-maintenance operations. The system implemented a time-multiplexing system, where all processes share and access resources on the single CPU. The Rovers recorded progress through the use of three primary log-file systems: event reports (EVRs), engineering data (EH&A) and data products.

System failures and vulnerabilities

The first race-condition bug occurred with the Spirit Rover Sol 131 (Each solar day in the life of the Rover is called a Sol). The initialization module (IM) process was preparing to increment a counter that keeps track of the number of times an initialization occurred. In order to do that, the IM process must request permission and be granted access to write that counter to memory (critical section). While requesting the permission, another process was granted access to use that very same piece of memory (critical section). This resulted in the IM process generating a fatal exception through its EVR log. The exception lead to loss and trouble in transmitting data to the NASA team on earth, which eventually led to the Rover being in a halted state for a few days. In efforts to keep the Rover functioning, the NASA team attempted to avoid the problem by restricting another module from operating during that time-frame, allowing enough time for the IM process to carry on its task. However, the NASA team were aware of the fact that the bug could actually resurface again. And it actually did later on in the Spirit Rover Sol 209 and then on the Opportunity Rover on Sol 596 and Sol 622.

A similar type of error occurred on the Spirit Sol 136, this time the Imaging Services Module (IMG) was involved. Just as the NASA team requested data from the Rover to be transmitted, the IMG started a deactivation state. The IMG reading cycles from memory were suddenly interrupted by the deactivation process which was attempting to power off the piece of memory associated with the IMG reading task. This resulted in a failure to return the requested data from the Rover.

Aftermath and current status

While those race conditions errors were clearly due to a lack of memory management and proper co-ordination among processes, they were largely unexpected and unforeseen. In contrast to the other cases mentioned so far, the consequences that the NASA team had to deal with weren't life threatening. So it seems that their main concern was to keep the Rovers functioning in order to obtain as much information as possible. No effort was even made to alter the software. Also, one could imagine that the task of examining and debugging those errors was quite a challenge, since they couldn't deal with the Rovers physically, rather everything was done via transmission and messages. Another thing to note is the fact that the single CPU used in those Rovers had a lot to deal with beside the usual software implementation. Had NASA considered the possibility of implementing a multiple CPU design, things could have been different.

The Spirit Rover has experienced a number of problems since then. Most recent reports revealed that the Rover has been largely inactive, with no data being received from the Rover. The Opportunity Rover on the other hand continues to function successfully.

Windows Blue-Screens-Of-Death

When a problem in Windows forces the operating system to fail, the computer will often display an error screen, officially known as a "Stop Error". The message displays what exactly went wrong in white font to a blue background. Because of the color scheme chosen, this screen as received the title "Blue Screen of Death" or BSoD in short. While BSoDs cover a much wider range of errors than simply race conditions, it is one of the more prominent errors.

The error 0x0000001a, MEMORY_MANAGEMENT, occurs due to a race condition involving memory management on the system.

BSoDs have been experienced by almost anyone who has used OS's like XP or Vista. There have also been more critical systems, such as airports and ATM machines that have also experienced such errors. A particularly public demonstration of the BSoD in action took place at the 2008 Beijing Summer Olympics in China, where one of the projectors crashed during the opening ceremonies.

Conclusion

Obviously, there is a wide range of problems that can occur due to race conditions. The problem is that any system could have a race condition bug, and it is impossible to simulate how the software will react to every possible hardware setup it might be coupled with. The examples above show that it is very important that life-critical and essential systems should have a higher standard of testing. With the incident involving the Therac-25, it is clear that legacy code should be thoroughly tested when implemented in a new system. With the 2003 blackout, it is clear that when installing in a new environment, the users should put the system through adequate testing themselves before using it in real-world applications. The Mars-Rover incident shows that even without implementation of old code or different environments race conditions can go unnoticed, bringing light to the fact that every possible combination of events should be tested, when failure could end up costing millions of dollars. Of course, on the more common side of things, the BSOD shows that even systems used in millions of computers are still capable of having deeply embedded bugs not yet fixed. So, while race conditions can not always be avoided, a higher standard of testing would greatly decrease the probability of it occurring.

References