COMP 3000 Essay 2 2010 Question 6
Paper
Effective Data-Race Detection for the Kernel
Paper: http://www.usenix.org/events/osdi10/tech/full_papers/Erickson.pdf
Video: http://homeostasis.scs.carleton.ca/osdi/video/erickson.mp4
Authors: John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk from Microsoft Research
Background Concepts
Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.
A data race is a potentially catastrophic event which can be alarmingly common in modern concurrent systems. When one thread attempts to read or write on a memory location at the same time that another thread is writing on the same location, there exists a potential data race condition. If the race is not handled properly, it could have a wide range of negative consequences. In the best case, there might be data corruption rendering the affected files unreadable and useless; this may not be a major problem if there exist archived, non-corrupted versions of the data. In the worst case, a process (possibly even the operating system itself) may freak out and crash, unable to decide what to do about the unexpected input it receives.
Traditional data-race detection programs operate by running an isolated runtime and comparing it with the currently active runtime, to find situations that would have resulted in a data race if the runtimes were not isolated. DataCollider operates by temporarily setting up breakpoints at random memory access instances. If a certain memory access hits a breakpoint, DataCollider springs into action. The breakpoint causes the memory access instruction to be postponed, and so the instruction pretty much goes to sleep until DataCollider has finished its job. The job is like taking a before and after photograph of something; DataCollider records the data stored at the address the instruction was attempting to access, then allows the instruction to execute. Then DataCollider records the data again. If the before and after records do not match, then another thread has tampered with the data at the same time that this instruction was trying to read it; this is precisely the definition of a data race.
[Don't worry guys; that's not all I've got. I'm still working on it.]
--Austin Bondio 01:56, 2 December 2010 (UTC)
Research problem
What is the research problem being addressed by the paper? How does this problem relate to past related work?
The research problem being addressed by this paper is the detection of erroneous data races inside the kernel without creating much overhead. This problem occurs because read/write access instructions in processes are not always atomic (e.g two read/write commands may happen simultaneously). There are so many ways a data race error may occur that it is very hard to catch them all.
The research team’s program DataCollider needs to detect errors between the hardware and kernel as well as errors in context thread synchronization in the kernel which must synchronize between user-mode processes, interrupts and deferred procedure calls. As shown in the Background Concepts section, this error can create unwanted problems in kernel modules. The research group created DataCollider which puts breakpoints in memory accesses to check if two system calls are calling the same piece of memory. There have been attempts at a solution to this problem in the past that ran in user-mode, but not in kernel mode, and they produced excessive overhead. There are many problems with trying to apply these techniques to a kernel.
One technique that some detectors in the past have used is the “happens before” method. This checks whether one access happened before another or if the other happened first, and if neither of those options were the case, the two accesses were done simultaneously. This method gathers true data race errors but is very hard to implement.
Another method used is the “lock-set” approach. This method checks all of the locks that are held currently by a thread, and if all the accesses do not have at least one common lock, the method sends a warning. This method has many false alarms since many variables nowadays are shared using other ways than locks or have very complex locking systems that lockset cannot understand.
Both these methods produce excessive overhead due to the fact that they have to check every single memory call at runtime. In the next section we will discuss how DataCollider uses a new way to check for data race errors, that produces barely any overhead.
Contribution
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)
Proving that there is a problem with classic solutions to race detectors:
The main contribution that DataCollider provides is the unique idea of using hardware breakpoints in a data race detector. The question is why is a unique idea necessary. Why does DataCollider have to "reinvent the wheel". There has been a plethora of race condition testers invented in the last two decades, and almost all of the dynamic data race detectors can be lumped into three categories. They either implement lock-set, happens-before, or a hybrid of the two types of detection. The research team for DataCollider looked at several of these implementations of race condition testers to find ways of improving their own program, and found that there are major problems in the classic ways of detecting race conditions.
Some of the programs that were referenced were:
- Eraser: A Dynamic Data Race Detector for Multithreaded Programs
- RaceTrack: Efficient Detection of Data Race Conditions via Adaptive Tracking
- PACER: Proportional Detection of Data Races
- LiteRace: Effective Sampling for Lightweight Data-Race Detection
- MultiRace: Efficient on-the-fly data race detection in multithreaded C++ programs
Eraser: A Dynamic Data Race Detector for Multithreaded Programs[1]
lock-set based reasoning
Eraser, a data race detector programmed in 1997, was one of the earlier data race detectors invented. It may have been a useful and revolutionary program of its time, however, it uses very low level techniques compared to most data race detectors today. One of the reason why it is unsuccessful is because it only checks whether memory accesses use proper locking techniques. If a memory access is found that does not use a lock, then Eraser will report a data race. In many cases, the misuse of proper locking techniques is a conscious decision by the programmer, so Eraser will report many false positives. Modern locking systems are also very complicated and have several different kinds of locks for different situations. It is difficult for one program to handle upwards of 12 types of locks, especially when they are very complicated. This does not take into account all of the benign problems such as date of access variables. Locking systems are notorious for reporting false positives such as this, and it is near impossible to change the architecture of the algorithm to ignore benign cases.
PACER: Proportional Detection of Data Races[2]
happens-before reasoning
Pacer, a happens-before data race detector, uses the FastTrack algorithm to detect data races. FastTrack uses vector-clocks to keep track of two potentially conflicting threads. If the two threads conflict, a data race is thrown, and the state of the program is saved. Pacer samples a percentage of each memory access, (from 1 to 3 percent) and runs the FastTrack happens-before algorithm on each thread that accesses that part of memory. Similar to Pacer, DataCollider samples a percentage of the program's memory accesses, but instead of using vector-clocks to catch the second thread, hardware breakpoints are used. Pacer runs with an overhead of approximatively one to three times the speed of the original program because it takes a considerable amount of processing power to maintain the vector-clocks. Hardware break points are considerably faster than vector-clocks, and as a consequence, DataCollider runs with less overhead than Pacer.
LiteRace: Effective Sampling for Lightweight Data-Race Detection[3]
happens-before reasoning
LiteRace, similar to Pacer, samples a percentage of memory accesses from a program. Where it differs is the parts of memory that LiteRace samples the most. The "hot spot" regions of memory are ones that are accessed most by the program. Since they are accessed the most, chances are that they have already been successfully debugged, or if there are data races there, they are benign. LiteRace detects these areas in memory as hot spots, and samples them at a much lower rate. This improves LiteRace's chances of capturing a valid data race at a much lower sampling rate. Where DataCollider bests LiteRace is based on LiteRace's installing mechanism. LiteRace needs to be recompiled into the software it is trying to debug, whereas DataColleder's breakpoints do not require any code changes to the program. This is a major success for DataCollider because often third party testers do not have the source code for a program.
RaceTrack: Efficient Detection of Data Race Conditions via Adaptive Trackings[4]
combination of lock-set and happens-before reasoning
RaceTrack uses a unique technique in order to detect data races. The program being debugged is run on top of RaceTrack as a virtual machine using the .NET framework, and it will examine all of the memory accesses that the program requests. As soon as suspicious behavior is exhibited, a warning is sent off to be later evaluated when the program terminates. RaceTrack uses this technique because several process intensive inspections of the state of the machine must be checked, and doing this on the fly is expensive. There are many problems with RaceTrack. It is very successful at detecting a vast percentage of data races, however, it has a high overhead and requires extreme amounts of memory. RaceTrack must save the state of the entire machine every time a warning is produced, and it also has to save each threads memory accesses to check which memory access "happened before". Since most warnings thrown are found to be benign, saving the state of the machine wastes computational power and memory. Long running programs also prove to be a problem, where the computer being debugged will run out of memory to store all of the warning states before the program terminates. It then will have to either increase overhead significantly to store the warnings on disk, or it will have to delete some warnings to make room for new ones.
MultiRace: Efficient on-the-fly data race detection in multithreaded C++ programs[5]
combination of lock-set and happens-before reasoning
MultiRace is another hybrid style race condition debugger that uses two unique algorithms. The first algorithm, Djit is the happens-before iteration, which INSERT STUFF HERE. The second is an improved iteration of the lock-set algorithm. MultiRace is the most similar program to DataCollider in terms of their goals. Both strive to decrease overhead to near standard running times of the program itself, and to increase the program transparency for maximum user compatibility. MultiRace itself is several orders of magnitude more complicated than DataCollider, but since MultiRace hides its complexity from the user with transparency, it is still simple to use. It is arguable that MultiRace is superior for detecting races for C++ programs, however, MultiRace is not compatible with any other programming language. Since DataCollider uses hardware breakpoints, the coding language of the program is irrelevant. Also, since DataCollider avoids using both lock-set and happens before algorithms, it is versatile enough to even debug kernels.
DataCollider is a very unique program. Most other dynamic race condition testers can be lumped into the three groups lock-set, happens-before, or hybrid. DataCollider, however, recognizes the errors of these styles of detection, and manages to avoid them completely. Even though there are issues with false positives and benign races, DataCollider provides very simple, versatile, and lightweight functionality in debugging a program. Future programs may take this unique style of race detection and add their own functionality to improve upon it. It could be that DataCollider could inspire a ground breaking solution to race conditions and how to detect them.
Critique
Style
This paper is well put together. It has a strong flow and there is nothing that seems out of place. The authors start with an introduction and then immediately identify key definitions that are used throughout the paper. In the second section which follows the introduction the authors identify the definition of a Data-Race as it relates to their paper. This is important since it is a key concept that is required to understand the entire paper. This definition is required because as the authors state there is no standard for exactly how to define a data-race.[1] In addition to important definitions any background information that is relevant to this paper is presented at the beginning. The key idea which the paper is based on in this case Data Collider and its implementation is explained. An evaluation and conclusion of Data Collider follow its description. The order of the sections makes sense and the author is not jumping around from one concept to another. The organization of the sections and information provided make the paper easy to follow and understand.
Content
Data Collider:
DataCollider seems like a very innovative piece of software. It’s new use of breakpoints inside kernel-space instead of lock-set or happens-before methods in user-mode let it check data race errors in the very kernel itself without producing as much overhead as its old contenders (it even finds data races for overheads less than five percent). One thing to note about DataCollider is that ninety percent of its output to the user is false alarms. This means that after running DataCollider, the user has to sift through all of the gathered data to find the ten percent of data that actually contains real data race errors.[1] The team of creator’s were able to create a to sort through all of the material it collects to only spit out the valuable information, but the creators still found some false alarms in the output . They have noted though that some users like to see the benign reports so that they can make design changes to their programs to make them more portable and scalable and therefore decided not to implement this. Even though DataCollider returns 90% false alarms the projects team have still been able to locate 25 errors in the Windows operating system. Of those 25 errors 12 have already been fixed.[1] This shows that DataCollider is an effective tool in locating data race errors within the kernel effectively enough that they can be corrected.
The overhead of any application running is very important to all users. The developers of DataCollider ran various tests to determine the overhead of running DataCollider based on the number of breakpoints. These results were included in the final paper. DataCollider has a low overall base overhead and it is only after 1000 breakpoints a second does the run time overhead increase drastically.[1] This adds to the effectiveness of DataCollider. Having a low overhead is very important to use of an application.
References
[1] Erickson, Musuvathi, Burchhardt, Olynyk, Effective Data-Race Detection for the Kernel, Microsoft Research, 2010.PDF