COMP 3000 Essay 2 2010 Question 5

Paper

Title: Bypassing Races in Live Applications with Execution Filters

Authors: Jingyue Wu, Heming Cui, Junfeng Yang

Affiliations: Computer Science Department, Columbia University

Supplementary Information: Video available here as well as slides

Background Concepts

A race condition is a system flaw that “occurs when two threads access a shared variable at the same time." Race conditions can be very complex, time consuming and expensive to fix. Unfortunately, the most challenging part of race condition is not fixing it, but rather finding it. Race conditions are notorious for being extremely difficult to find, isolate and recreate. To help ease this process, the authors of this paper, Jingyue Wu, Heming Cui, Junfeng Yang, propose the adoption of LOOM.

LOOM is a system which dynamically locates areas which may be susceptible to race condition errors and allows the race condition to be potentially fixed. The power of LOOM rests in its ability to operate on live applications in real time. This is possible thanks to its evacuation algorithm which injects execution filters to fix race conditions at runtime. Execution filters, otherwise known as request filtering, allow you to inspect the request before and after the main logic is executed. By leveraging execution filters as the means for correcting race conditions, LOOM is able to operate with very little performance overhead and is a highly scalable as the number of application threads increases.

The authors tested LOOM on existing real world race conditions found in common applications. The tests found that all tested race conditions were solved, with little performance overhead, in a scalable and easy to implement manner.

This paper consists of multiple terms which must be familiar to the reader in order to assist in reading the Bypassing Races in Live Applications with Execution Filters paper. These terms are listed and explained below:

Deadlock: Deadlocks usually occur within the context of two threads. One thread tries to lock a variable that the other thread has already locked and vice versa. The result of this is that each thread is waiting for each other's thread to release the variable. Thus a deadlock occurs and nothing can happen.

Evacuation An algorithm for proactively pausing (and potentially changing) a state of code that will likely lead to a Race condition. One evacuates code from an unsafe location in memory so that one can install an execution filter there and prevent a race condition. Code is evacuated by either pausing it before it enters ab area or allowing it to quickly leave if it is already there.

Execution Filters: Otherwise known as request filtering. Execution Filters allow you inspect a request to use a block of data before and after any logic is executed on it. You can, in effect, filter what runs when, perhaps pausing one piece of code while another works on an area, preventing a race condition.

Function Quiescence The process of pausing and altering states, in order to avoid race conditions and overlapping between code segments.

Hot Patches: "Hot patching provides a mechanism to update system files without rebooting or stopping services and processes." [1]

Hybrid Instrumentation Engine: "Instrumentation refers to an ability to monitor or measure the level of a product's performance, to diagnose errors and writing trace information." [2] Instrument programs can have low runtime overhead, but instrumentation has to be done at compile time. Dynamic instrumentation can update programs at runtime but incur high overhead. A hybrid instrumentation is an implementation of combined static and dynamic instrumentation.

Lock: A lock is a way of limiting access to a common resource when using multiple threads. Lock and unlock methods are usually called at the beginning and end of a target method, respectively. It simply denies access when it is locked, and allows access when not locked. "Mutual exclusion locks (mutexes) are a common method of serializing thread execution. Mutual exclusion locks synchronize threads, usually by ensuring that only one thread at a time executes a critical section of code. Mutex locks can also preserve single-threaded code." [3]

Mutex: Mutex (Mutual Exclusion) is an algorithm which prevents race conditions on a resource. Essentially, it forces any threads that are trying to access the resource to wait until the current thread accessing the resource has completed using it.

Race Condition: "A race condition occurs when two threads access a shared variable at the same time." [4]

Semaphore: Semaphores are basically a special type of flag and generalize a down and up state (sleep or wakeup). The down operation checks to see if the value is greater than 0 and if so, decrements the value and uses up one stored wakeup. If the value is 0, the process is put to sleep. These steps are all done in a single indivisible atomic action. It is guaranteed that once a semaphore operation has started, no other process can access the semaphore until the operation has been completed or blocked. Semaphores are an essential part of solving synchronization problems. [5]

Research problem

Problem being addressed

With the rise of multiple core systems, multi-threaded programs are often prone to race conditions. Races are hard to detect, test and debug. Many systems are designed to detect, reproduce and diagnose race conditions, but these do not directly address the actual race. This is normally dealt with via a software update which would often require a software restart and potentially introduces new bugs. [7] Patches require you to be aware of the cause of the problem, and take time to produce, test, and install, leaving a potential user of the software waiting, potentially for months. Due to the immaturity of current race detectors, this paper explains a new approach to race detection and workarounds through the use of LOOM. The goal being to quickly address the actual symptom, the race condition, in real time on live systems. This is in opposition to the conventional approach which focuses on the cause of the race condition, an unknown bug in the software. By targeting the race condition itself, the system is able to keep running without a software update or even a reset.

Related work

Two common solutions to fixing deployed races are software updates and hot patches. Software updates require restarts whereas hot patches applies patches to live systems. However, relying on conventional patches can lead to new errors and could be unsafe, due to a multithreaded applications complexity. Releasing a reliable patch takes time, but developers often resort to more efficient fixes rather than placing proper locks in the application due to performance or work pressure.

Using a QUIESCE function to "temporarily suspend...incoming messages on an IUCV path" [6], these paths can later be reactivated and run as normal. This is not efficient for fixing a race condition because it only delays the problem in an attempt to avoid conflict. Although this does allow for a certain extent of safety it does not come near the reliability and flexibility of LOOM. Speed, reliability, flexibility and ease of use are all areas in which LOOM is demonstrated as being better than a QUIESCE function.

Another similar system to LOOM is STUMP [8]. STUMP is a system for releasing live updates for multi-threaded or single-threaded programs written in C. It has the ability to provide arbitrary patches to source code in running systems without requiring a reset. These patches require considerable annotation and preparation as source code modifications are considered to be unsafe. Unlike STUMP, LOOM does not operate on the source code and is considered more safe because of this.

The most recent system for live updates to the kernels of Operating Systems is called Ksplice [9]. It allows users to update the Linux kernel without resetting. This can be done either completely automatically if the code does not chance any data structures, or with on average 17 lines of code for an update that would otherwise require a reset. [9] It does this by operating on the object layer.

Contribution

Current solution expressed

Compared to traditional solutions, LOOM differs in its approach to race fixes. It is designed to quickly develop safe, optimized, temporary workarounds while a concrete solution is developed. LOOM is also very easy to use. LOOM is compiled with a developer's application as a plugin and kept separate from the source code. The plugin will inject the LOOM Update Engine into the application binary.

Mutual exclusion filters are written by the developer and synced with the source code to filter out any racy threads. The filter's code declaration is easy to understand and can be inserted in a code region that needs to be mutually exclusive. The developer does not need to deal with low level operations such as lock, unlock and semaphore operations. Users can then download the filter and apply it to the application while it is still live.

LOOM is flexible in that developers can make trade-offs between performance and reliability in their application. These can include making two code regions mutually exclusive when accessing different objects; or, with a significant decrease in performance, making them run in single threaded mode.

An evacuation algorithm is used for safety as to not introduce new errors. A critical region is marked using static analysis. All threads in the critical region are then evacuated. After the evacuation is executed, the execution filter is installed and then the threads are resumed after a live update pause is done at a safe location.

LOOM's hybrid instrumentation engine is used to reduce its overhead. The engine statically changes an application's binary to anticipate dynamic updates.

The evaluation of LOOM was based on overhead, scalability, reliability, availability and timeliness. These were demonstrated using Apache and MySQL in conjunction with the multithreaded ApacheBench and SysBench, respectively.

Through multiple tests the advantages of LOOM were proven. Overhead was tested in a comparison of LOOM during normal runtime. The effects of LOOM on Apache and MySQL were minimal, (~1.83% and ~4% respectively) causing it to be a viable as a runtime fix for race errors. To test scalability the team discovered that on 32 server threads, the overhead was still low: under 3% and 12% respectively. Reliability is one of the strongest facets of the LOOM system as it fixed all of the race conditions studied. To demonstrate LOOM's reliability, it was compared against a conventional restart-based software update. In this test the software update was clearly slower, requiring time to reset itself, where LOOM running a live update had almost no effect on the throughput. Lastly the timeliness of the installation of LOOM's fixes was demonstrated in a simple example. The LOOM based fix completed in 368ms whereas the function quiescence fix took the max test time (1 hour) and was not finished.

Why is it any better than what came before?

Previously, the two standard ways of fixing deployed race conditions were system updates and hot patches. LOOM is a superior choice to both these options for a number of reasons.

Unlike LOOM, the system update approach requires that the system be rebooted before the fix can be implemented. With desktop applications, rebooting a system is acceptable. However, servers often cannot reboot because requests are coming from external sources and are expected to be processed.

While hot patches do not require a system reboot, they do have their own specific vulnerabilities. Namely, it is very difficult to apply a patch that corrects the error, or errors, but leaves the rest of the system unaffected. Often when correcting a race condition via a hot patch, others can appear. The main concern with hot patches however, is that their development is a time consuming process. A process which until developed and deployed, leaves the race condition vulnerable and exposed. The paper chronicles a real world Mozilla race condition whose hot patch took nearly 8 years of development to correct. All the while the vulnerability was exposed to Mozilla users.

Flaws common to both system updates and hot patches are that they are very difficult to properly develop, slow to implement, and result in potentially unsafe ad hoc solutions that are not scalable.

Conversely, LOOM is easy to use, fast to implement, highly flexible, scalable, and safe to use.

Critique

Good

The authors of this essay effectively convey their findings by staying focused on the thesis as well as supporting their topics with relevant examples and data. Examples throughout the paper, particularly the MySQL example, ensure that the use of execution filters is clear to the reader. All of the examples are well documented and some (ex. Figure 2) are simplified to avoid confusing the reader with too much unnecessary information. The references throughout backup the reliability of the paper and let the reader to verify information from the sources.

The essay flows well and the information is delivered in a well put together order, allowing the reader to learn enough about LOOM (or any of the sub-topics involved in the explanation) before being informed about the next relative subject. The conclusion summarizes the paper in a clear and concise manner.

In terms of the technology that this paper serves to introduce, the offering is very strong. The authors make aggressive assertions stating the strengths of LOOM and back these assertions with unequivocal results through extensive testing. Specifically, the assertions made are that LOOM can dynamically locate and correct areas which may be susceptible to race conditions on live systems in real time, in a scalable, flexible, easy to use, safe to use, and fast to implement manor. The testing presented to support these assertions were nine real world race conditions of which each was successfully located and corrected, with little overhead.

Not-So-Good

One of the problems with this paper is that some of the examples are oversimplified. For example, Figure 9 attempts to represent the evacuation process. Unfortunately, this causes the problem to seem trivial.

The writers are also biased towards LOOM. Although they do admit the limitations of LOOM, they do not elaborate any further. They promote LOOM without discussing possible problems with it, such as the clients running LOOM may decide not to fix the race conditions and let the program continue to run with LOOM as a permanent fix. This may cause further errors in the long term life of the program.

References

[1] "Introduction to Hotpatching". Microsoft TechNet . Microsoft Corporation, 2010. Web. Accessed: Dec 1st 2010. <http://technet.microsoft.com/en-us/library/cc781109(WS.10).aspx>.

[2] "Introduction to Instrumentation and Tracing". MSDN . Microsoft Corporation, 2010. Web. Accessed: Dec 2nd 2010. <http://msdn.microsoft.com/en-us/library/aa983649(VS.71).aspx>

[3] Marshall, A. D. "Further Threads Programming:Synchronization.". Cardiff School of Comp. Sci. and Info. . Cardiff University, 1999. Web. Accessed: Dec 2nd 2010. <http://www.cs.cf.ac.uk/Dave/C/node31.html#SECTION003110000000000000000>

[4] "Description of race conditions and deadlocks". Microsoft Support . Microsoft Corporation, December 6, 2006. Revision: 2.3. Web. Accessed: Dec 2nd 2010. <http://support.microsoft.com/kb/317723>

[5] Tanenbaum, A. S. Modern Operating Systems (3rd Edition), page 128, 2008. Print.

[6] "QUIESCE Function." IBM IBM Corporation, 2008. Web. Accessed: Dec 2nd 2010. <http://publib.boulder.ibm.com/infocenter/zvm/v5r3/index.jsp?topic=/com.ibm.zvm.v53.hcpb4/hcse5b21270.htm>

[7] Lu, Shan; Park, Soyeon; Seo, Eunsoo; Zhou, Yuanyuan. "Learning from Mistakes — A Comprehensive Study on Real World Concurrency Bug Characteristics". CiteSeerX . Dept. of Comp. Sci. at Univ. of Illinois, 2008. Web. Accessed Dec 2nd 2010. <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.121.1203>

[8] Neamtiu, Iulian; Hicks, Michael. "Safe and Timely Dynamic Updates for Multi-threaded Programs". ACM Digital Library. Association for Computing Machinery, 2009. Web. Accessed: Dec 2nd 2010. <http://portal.acm.org/citation.cfm?id=1542479>

[9] "Ksplice: Automatic Rebootless Kernel Updates". Ksplice. Massachusetts Institute of Technology, April 2009. Web. Accessed Dec 2nd 2010. <http://www.ksplice.com/doc/ksplice.pdf>