Revision as of 05:13, 14 October 2010

Question

What are some examples of notable systems that have failed due to flawed efforts at mutual exclusion and/or race conditions? How significant was the failure in each case?

Answer

Introduction

Race conditions have their fare share of notoriety in the history of software bugs. This may range from a piece of Java code causing the application to halt, to web services corruption to life-critical system failures that lead to fatal results. In this article, we will define race conditions, examine some of the most well known cases involving race conditions and take a look at some of the solution schemes and ways the industry have proposed to track and detect race conditions.

Overview

Race conditions is the term used in situations where two or more processes can access the same piece of data simutaneously and the end result depends on the timing sequence of those processes. This end result can be quite hazardous leading to major system failures. The main challenge with race condition errors is that they're usually unpredictable and can be triggered in various ways depending on the processes involved and the surrounding environment, making it a nightmare for the programmers to debug and track the error.

Examples

Therac-25

The Therac-25 was an x-ray machine developed in Canada by Atomic Energy of Canada Limited (AECL). The machine was used to treat people using radiation therapy. Between 1985 and 1987 six patients were given overdoses of radiation by the machine. Half these patients died due to the accident. The incident is quite possibly the most infamous software bug relating to race conditions. The cause of the incidents has been traced back to a programming bug which caused a race-condition. The Therac-25 software was written by a single programmer in PDP-11 assembly language. Portions of code were reused from software in the previous Therac-6 and Therac-20 machines. The main portion of the code runs a function called “Treat” this function determins which of the programs 8 main subroutines it should be executing. The Keyboard handler task ran concurrently with “Treat”.

Main Subroutines

The Therac-25 had 8 main subroutines it made use of. The Datent had its own helper routine called magnet which prepared the x-rays magnets to administer the correct dosage of radiation.

Reset
Datent
1. Magnet
Set Up Done
Set Up Test
Patient Treatment
Pause Treatment
Terminate Treatment
Date, Time, ID Changes

The Datent subroutine communicated with the keyboard hander task through a shared variable which signaled if the operator was finished entering the necessary data. Once the Datent subroutine sets the flag signifying the operator has entered the necessary information it allows the main program to move onto the next subroutine. If the flag was not set the “Treat” task reschedules itself in turn rescheduling the Datent subroutine. This continues until the shared data entry flag is set.

The Datent subroutine was also responsible for preparing the x-ray to administer the correct radiation dosage. The subroutine was setup so that before returning to “Treat” instructions to move on to the next of its 8 subroutines it would first call the “Magnet” subroutine. This subroutine parsed the operators input and moved the x-ray machines magnets into position to administer the prescribed radiation. This magnet subroutine took approximately 8 seconds to complete and while it ran the keyboard handler was also running. If the operator modified the data before the “magnet” subroutine returned their changes would not be register and the x-ray strength would already be set to its prior value ignoring the operator’s changes.

Example Bug Situation

The situation below illustrates a chain of events that would result in an unintended dose of radiation being administered.

Operator types up data, presses return
(Magnet subroutine is initiated)
Operator realizes there is an extra 0 in the radiation intensity field
Operator quickly moves cursor up and fixes the error and presses return again.
Magnets are set to previous power level .subroutine returns
Program moves on to next subroutine without registering changes
Patient is administered a lethal overdose of radiation

Black-out of 2003

On August 14th, 2003, a massive power outage spread through the Northeastern and Midwestern United States and Canada. A generating plant in Eastlake, Ohio went offline, causing a domino affect ultimately leading to over 100 power plants shutting down.

There are several reasons that are attributed to this massive failure. One of the most prominent factors being a software bug in General Electric Energy's Unix-based XA/21 energy management system.

FirstEnergy's Akron, Ohio control center was responsible for monitoring the Eastlake plant. However, the software flaw caused the control center to be unable to receive any warning or alarm from the plants.

Because of this, the control center's ability to prevent the cascading effect after the Eastlake plant went offline.

The XA/21 bug was triggered through a unique combination of events and alarm conditions on the equipment it was monitoring. The main system failed, unable to handle the combination of requests. By the time the back-up server kicked in, the accumulation of events since the main system failure caused it to go down as well.

The system made no indication that it had failed, and the control center received no warnings about the fact that they were operating without an alarm system.

The combination which caused the first system failure itself was due to three sagging power lines being tripped simultaneously. The three separate events attempted to execute on a shared state, causing no alarm to be raised and the system to fail.

The NASA Mars-Rover

The NASA Mars-Rover incident is another well known case of system failure due to race conditions. The Mars-Rover is a six wheeled driven, four wheeled steered vehicle designed by NASA to navigate the surface of Mars in order to gather videos, images, samples or and possible data about the planet. NASA landed two Rover vehicles, the Spirit and Opportunity Rovers, on January 4 and January 25, 2004, respectively. The Rover was controlled on a daily basis by the NASA team on earth by sending messages and tasks. Each solar day in the life of the Rover is called a Sol.

Hardware design and architecture

The vehicle's main operating equipment consists of a set of high-resolution cameras, a collection of specialized spectrometers and a set of radio antennas for transmitting and receiving data. The main computer was built around a BAE RAD-6000 CPU (Rad6k), RAM and non-volatile memory (a combination of FLASH and ROM).

Software design

The Rover software was mostly implemented in ANSI C, with some fragements of code written in C++ and assembly. The rover relied on an autonomous system that enabled it to drive itself and carry out a number of self-maintenance operations. The system implements a time-multiplexing system, where all processes share and access resources on the single CPU. The Rover records progress through the use of three primary log-file systems: event reports (EVRs), engineering data (EH&A) and data products.

System failures and vulnerabilities

The first race-condition bug occured in the Spirit Rover Sol 131. The initilazation module (IM) process was preparing to increment a counter that keeps track of the number of times an initilazation occured, in order to do that, the IM process must request permission and be granted access to write that counter to memory (critical section). While requesting the permission, another process was granted access to use that very same piece of memory (critical section). This resulted in the IM process generating a fatal exception through its EVR log. The exception lead to loss and trouble in transmitting data to the NASA team on earth, which eventually led to the Rover being in a halt state for a few days. The NASA team attempted to solve the problem by rebooting the Rover and restricting another module from operating during that time-frame. However, the same bug reoccured in the Spirit Rover on Sol 209 and then on the Opportunity Rover on Sol 596 and Sol 622.

A similar type of error occurred on the Spirit Sol 136, this time the Imaging Services Module (IMG) was involved. Just as the NASA team requested data from the Rover to be transmitted, the IMG was beginning a deactivation state, the IMG reading cycles from memory were suddenly interrupted by the deactivation process which was attempting to power off the piece of memory associated with the IMG reading task. This resulted in a failure to return from the Rover.

Windows Blue-Screens-Of-Death

When a problem in Windows forces the operation systems to fail, the computer often displays an error screen, know as Stop message, that describes the cause of the problem, most people called this a Blue Screen of Death (BSOD).

The error 0X0000001a, MEMORY_MANAGEMENT, occurs because of the race condition of memory management. It is a hardware error related to memory management. It is possible that the computer can not timely get enough power to the memory for the process.

The BSOD has surfaced on a number of Windows versions including Windows 7. It has also caused system failures in airports, ATM machines and street hoardings. However, the most notable public incident happened on the opening ceremony of the 2008 Beijing Summer Olympics in China, when one of the projectors crashed because of a BSOD bug.

Conclusions

The need to control race conditions and maintain concurrency and safe sharing of resources among processes brings us to the concept of mutual exclusion (Mutex). Mutual exclusion is the idea of making sure processes access data in a serialized way. Meaning that, if process A for instance, happens to be executing or using a particular data structure (called a critical section), then no other process like B would be allowed to execute or use that very same data structure (critical section) until process A finishes executing or decides to leave the data structure. Common algorithms and techniques used to establish mutual exclusion include locks, semaphores and monitors.

A handful of commercial software tools have been developed to address and detect race conditions errors as well. More recently, a US software company that goes by the name of ReplaySolutions has been awarded a patent from the US government for developing an innovative kit for debugging race conditions found in software.

As the industry strives for faster and more efficient level of performance through the use of multi-processor systems and multi-core chips, this area continues to be a vast field for research and innovations within the computing world.

References

Nancy Leveson. Medical Devices: The Therac-25
Reeves and Snyder. An Overview of the Mars Exploration Rovers' Flight Software, another source
Matijevic and E. Dewell. 2006 Anomaly Recovery and the Mars Exploration Rovers
Dreaded Blue Screen of Death strikes Olympics [1]
Patent Awarded for Debugging Race Conditions [2]

@@ Line 7: / Line 7: @@
 =Introduction=
-Race conditions bugs have their fare share of notoriety in the history of software bugs. This may range from a piece of Java code causing the application to halt, to life-critical system failures that lead to fatal results. In this article, we will define race conditions, examine some of the most well known cases involving race conditions and explore some of the solution schemes and ways the industry have proposed to track and detect race conditions.
+Race conditions have their fare share of notoriety in the history of software bugs. This may range from a piece of Java code causing the application to halt, to web services corruption to life-critical system failures that lead to fatal results. In this article, we will define race conditions, examine some of the most well known cases involving race conditions and take a look at some of the solution schemes and ways the industry have proposed to track and detect race conditions.
 =Overview=
@@ Line 13: / Line 13: @@
 Race conditions is the term used in situations where two or more processes can access the same piece of data simutaneously and
 the end result depends on the timing sequence of those processes. This end result can be quite hazardous leading to major system
-failures.
+failures. The main challenge with race condition errors is that they're usually unpredictable and can be triggered in
+various ways depending on the processes involved and the surrounding environment, making it a nightmare for
+the programmers to debug and track the error.
-The need to control those race conditions and maintain concurrency and safe sharing of resources among processes brings us to the concept of mutual exclusion (Mutex). Mutual exclusion is the idea of making sure processes access data in a serialized way. Meaning that, if process A for instance, happens to be executing or using a particular data structure (called a critical section), then no other process like B would be allowed to execute or use that very same data structure (critical section) until process A finishes executing or decides to leave the data structure. Common algorithms and techniques used to establish mutual exclusion include locks, semaphores and monitors.
 =Examples=
 == Therac-25 ==
-The Therac-25 was an x-ray machine developed in Canada by Atomic Energy of Canada Limited (AECL). The machine was used to treat people using radiation therapy. Between 1985 and 1987 six patients were given overdoses of radiation by the machine. Half these patients died due to the accident. The cause of the incidents has been traced back to a programming bug which caused a race-condition.
+The Therac-25 was an x-ray machine developed in Canada by Atomic Energy of Canada Limited (AECL). The machine was used to treat people using radiation therapy. Between 1985 and 1987 six patients were given overdoses of radiation by the machine. Half these patients died due to the accident. The incident is quite possibly the most infamous software bug relating to race conditions. The cause of the incidents has been traced back to a programming bug which caused a race-condition.
 The Therac-25 software was written by a single programmer in PDP-11 assembly language. Portions of code were reused from software in the previous Therac-6 and Therac-20 machines.
-The main portion of the code runs a function called “Treat” this function determins which of the programs 8 main subroutines it should be executing. The Keyboard handler task ran concurrently with “Treat”
+The main portion of the code runs a function called “Treat” this function determins which of the programs 8 main subroutines it should be executing. The Keyboard handler task ran concurrently with “Treat”.
 ===Main Subroutines===
@@ Line 75: / Line 76: @@
 == The NASA Mars-Rover ==
+The NASA Mars-Rover incident is another well known case of system failure due to race conditions. The Mars-Rover is a six wheeled driven, four wheeled steered vehicle designed by NASA to navigate the surface of Mars in order to gather videos, images, samples or and possible data about the planet. NASA landed two Rover vehicles, the Spirit and Opportunity Rovers, on January 4 and January 25, 2004, respectively. The Rover was controlled on a daily basis by the NASA team on earth by sending messages and tasks. Each solar day in the life of the Rover is called a Sol.
-The NASA Mars-Rover incident is another well known case of system failure due to race conditions. The Mars-Rover is a six wheeled driven, four wheeled steered vehicle designed by NASA to navigate the surface of Mars in order to gather videos, images, samples or any possible data about the planet. NASA landed two Rover vehicles, the Spirit and Opportunity Rovers, on January 4 and January 25, 2004, respectively. The Rover was controlled on a daily basis by the NASA team on earth by sending messages and tasks. Each solar day in the life of the Rover is called a Sol.
 ===Hardware design and architecture===
 The vehicle's main operating equipment consists of a set of high-resolution cameras, a collection of specialized spectrometers and a set of radio antennas for transmitting and receiving data. The main computer was built around a BAE RAD-6000 CPU (Rad6k), RAM and non-volatile memory (a combination of FLASH and ROM).
 ===Software design===
+The Rover software was mostly implemented in ANSI C, with some fragements of code written in C++ and assembly. The rover relied on an autonomous system that enabled it to drive itself and carry out a number of self-maintenance operations. The system implements a time-multiplexing system, where all processes share and access resources on the single CPU. The Rover records progress through the use of three primary log-file systems: event reports (EVRs), engineering data (EH&A) and data products.
-The Rover software was mostly implemented in ANSI C, with some fragements of code written in C++ and assembly. The rover relied on an autonomous system that enabled the rover to drive itself and carry out a number of self-maintenance operations. The system implements a time-multiplexing system, as all processes share resources on the CPU. The Rover records progress through the use of three primary log-file systems: event reports (EVRs), engineering data (EH&A) and data products.
+===System failures and vulnerabilities===
+The first race-condition bug occured in the Spirit Rover Sol 131. The initilazation module (IM) process was preparing to increment a counter that keeps track of the number of times an initilazation occured, in order to do that, the IM process must request permission and be granted access to write that counter to memory (critical section). While requesting the permission, another process was granted access to use that very same piece of memory (critical section). This resulted in the IM process generating a fatal exception through its EVR log. The exception lead to loss and trouble in transmitting data to the NASA team on earth, which eventually led to
-===System failures===
+the Rover being in a halt state for a few days. The NASA team attempted to solve the problem by rebooting the Rover and restricting another module from operating during that time-frame. However, the same bug reoccured in the Spirit Rover on Sol 209 and then on the Opportunity Rover on Sol 596 and Sol 622.
-The first race-condition bug occured in the Spirit Rover on Sol 131. The initilazation module (IM) process was preparing to increment a counter that keeps track of the number of times an initilazation occured, in order to do that, the IM process must request permission and be granted access to write that counter to memory (critical section). While requesting the permission, another process was granted access to use that very same piece of memory (critical section). This resulted in the IM process generating a fatal exception through its EVR log. The exception lead to loss and trouble in transmitting data to the NASA team on earth. The NASA team attempted to solve the problem by restricting another module from operating during that time-frame. However, the same bug reoccured in the Spirit Rover on Sol 209 and then on the Opportunity Rover on Sol 596 and Sol 622.
+A similar type of error occurred on the Spirit Sol 136, this time the Imaging Services Module (IMG) was involved. Just as the NASA team requested data from the Rover to be transmitted, the IMG was beginning a deactivation state, the IMG reading cycles from memory were suddenly interrupted by the deactivation process which was attempting to power off the piece of memory associated with the IMG reading task. This resulted in a failure to return from the Rover.
 ==Windows Blue-Screens-Of-Death==
@@ Line 100: / Line 99: @@
 =Conclusions=
+The need to control race conditions and maintain concurrency and safe sharing of resources among
+processes brings us to the concept of mutual exclusion (Mutex). Mutual exclusion is the idea of making sure
+processes access data in a serialized way. Meaning that, if process A for instance, happens to be executing or
+using a particular data structure (called a critical section), then no other process like B would be allowed
+to execute or use that very same data structure (critical section) until process A finishes executing or decides
+to leave the data structure. Common algorithms and techniques used to establish mutual exclusion include locks, semaphores and monitors.
+A handful of commercial software tools have been developed to address and detect race conditions errors as well. More recently, a US software company that goes by the name of ReplaySolutions has been awarded a patent from the US government for developing an innovative kit for debugging race conditions found in software.
+As the industry strives for faster and more efficient level of performance through the use of multi-processor systems and multi-core chips, this area continues to be a vast field for research and innovations within the computing world.
 =References=
 * Nancy Leveson. [http://sunnyday.mit.edu/papers/therac.pdf Medical Devices: The Therac-25]
-* Reeves and Snyder. [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1571113&userType=inst An Overview of the Mars Exploration Rovers' Flight Software]
+* Reeves and Snyder. [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1571113&userType=inst An Overview of the Mars Exploration Rovers' Flight Software],  [http://trs-new.jpl.nasa.gov/dspace/bitstream/2014/37499/1/05-0539.pdf another source]
 * Matijevic and E. Dewell. 2006 [http://trs-new.jpl.nasa.gov/dspace/bitstream/2014/39897/1/06-0922.pdf Anomaly Recovery and the Mars Exploration Rovers]
 * Dreaded Blue Screen of Death strikes Olympics [http://news.cnet.com/8301-17938_105-10015872-1.html]
+* Patent Awarded for Debugging Race Conditions [http://www.drdobbs.com/tools/225600068]