Talk:COMP 3000 Essay 1 2010 Question 6
Hey guys, this is Munther. I'm one of the members of the group assigned to this question. Before we start, let me just say that since this is a collective piece of work thats supposed to include contributions from each member of the group, let us all assume the role of the editor. So we will all contribute and help edit the final version of the article.
Regarding our question. As a starting point, I figured it would be appropriate to start defining what mutual exclusion (mutex) and race conditions mean. Lets start with race conditions, since mutual exclusion basically came to life because of the need to control race conditions.
Race conditions: situations where one or more processes are trying to write, read or access the same piece of data, and the final result depends on who runs precisely when. Look at the text book in pages 117-118 for a detailed example of that.
Mutual exclusion (mutex): the idea of making sure that processes access data in a serialized way. Meaning that, if process A for instance, happens to be executing or using a particular data structure (called a critical section), then no other process like B would be allowed to execute or use that very same data structure (critical section) until process A finishes executing or decides to leave the data structure. Common algorithms and techniques used in mutual exclusion include: locks, semaphores and monitors.
Our question asks for examples of systems that have failed due to flawed efforts. For starters, this is a wiki-programming page (Rosetta code) that examines race conditions and offers an example from the Unix/Linux operating systems, whether the example mentioned here is considered a "failure" we should check with the prof. Anyways, its a good starting point. http://rosettacode.org/wiki/Race_condition
Heres also a paper that goes back to 1992, which basically examines the excessive amount of expenses and resources used in older versions of the Unix system when implementing mutual exclusion. The paper goes to explain the problem and offers a better solution. Its pretty easy to follow and understand, worth reading as well. http://www.usenix.org/publications/library/proceedings/sa92/moran.pdf
-- Munther --Hesperus 16:21, 11 October 2010 (UTC)
Hey Andrew here another member of this group. Those are some good starting points. The Wikipedia page on race conditions have references to a few good examples http://en.wikipedia.org/wiki/Race_condition
Couple notable ones:
The Therac-25 x-ray machine which killed a bunch of people http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Side_bar_1.html
A blackout in 2003 was caused by a race condition in one of the power company's alarm systems http://www.securityfocus.com/news/8412 (really awful block of text)
--Andrew
Alright, so the things that the prof mentioned in our last lecture proved to be super helpful. Basically, what he means by "systems", is any device based operating system. It doesn't necessarily has to be a PC-based operating system (Windows, Linux, etc.). So the Therac-25 story mentioned by Andrew in the above post is a prime example of the type of things we might be looking for.
Other notable examples:
1. The Opportunity Mars-Rover 1116 incident. (A rover is basically a space exploration vehicle designed to navigate the surface of a planet in order to gather images, samples or any possible information about that particular surface.). The rover experienced a rare unexpected error due to a race-conditions fault. For some reason, this seems to be a fairly common problem for those Mars-Rovers, since the same kind of error was experienced on the Spirit Mars-Rover as well.
Heres an overview of the Opportunity 1116 incident from MarsToday : http://www.marstoday.com/news/viewsr.html?pid=23772
Heres a paper that examines the race conditions experienced on those rovers, discuses the Spirit Rover incident and even goes to explain the underlying architecture of the rover hardware: http://trs-new.jpl.nasa.gov/dspace/bitstream/2014/39897/1/06-0922.pdf
2. A file-system based type of race condition involves an older version of the Unix operating system, in which the user-mode can actually be bypassed, allowing the user to access the entire system. I can see this being considered an error or a case of failure as well. This actually may be a bit more approachable, as far as understanding the Unix kernel and stuff like that, I'm sure we can find a lot of resources for this.
A small article exploring the issue: http://www.osdata.com/holistic/security/attacks/racecond.html
- - - - - - - - - - -
Heres also a paper that examines Race Conditions in depth, talks about the importance of mutual exclusion and provides a number of solutions : http://www.google.ca/url?sa=t&source=web&cd=4&ved=0CCoQFjAD&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.1.5897%26rep%3Drep1%26type%3Dpdf&rct=j&q=race%20conditions%20case%20study%20steve%20carr%2010.1.1.1&ei=FTCtTOzRN8mVnAeL-OThDA&usg=AFQjCNHdyHdeFSpES0nMjzb7lPkFxKwC2g&sig2=u2Qo9kdemxdCWAlH10GNeQ
Heres another paper from the ACM Portal: http://portal.acm.org/citation.cfm?id=130616.130623&coll=Portal&dl=GUIDE&CFID=104720795&CFTOKEN=13393160
If anyone can't access the pdf files on the ACM Portal or even CiteSeer for that matter, you need to log in to the netwrk using your Carleton library account. Go to the following: http://portal.acm.org.proxy.library.carleton.ca You will be prompted to enter your Student ID card barcode number, thats the number below your name on your student ID. And the password is your CarletonCentral password.
I think so far we have managed to gather a handful amount of cases. In the next couple of days, we should probably delve deeper into some of those cases.
PS: If you wanna contact me, go to my profile in the history tab. Click on Hesperus.
-- Munther --Hesperus 16:21, 11 October 2010 (UTC)
Hey guys, I am Daniel. I am also in group 6 (Am i the final group member?). I'm ready to help get this show on the road! I am going to set up a basic essay structure on the other page so that we know what to aim for. You guys look like you've rounded up quite a bit of info on the subject already, great job!
Introduction Paragraph: Introduces the question and gives some general background etc. Paragraph 1: Gives first example in detail Paragraph 2: Gives second example in detail Paragraph 3: Gives third example in detail Conclusion: Relates it all back together or something (never been good with conclusions)
I think each example paragraph should be broken down like this:
1. Introduction to the example 2. What they tried to use the Multi-Threading to do (or something like that) 3. Story of the system failing 4. The significance/involvement of race condition and mutual exclusion in the failure 5. Conclusion (how it was solved and stuff like that can go here too)
Dsont 03:05, 11 October 2010 (UTC) (this date is wrong for this edit)
Hey guys, I'm Fangchen. I am also in group 6. (So I might be the last member lol)
I found a chapter of a book from sun, which name of the chapter is Race Conditions and
Mutual Exclusion.There are some examples on race condition in Java programming which i think we could study for sure.
The link of the book chapter is here.
http://java.sun.com/developer/Books/performance2/chap3.pdf
On page 2 of the pdf file, there is a first example of race condition. I think this might be useful in our essay as a case study.
--Fangchen
My name is Julie and I believe that I am the last group member. Our professor said that every group has 5 to 6 members. It appears that we have quite the list of resources. Are we planning to use them all? It might be a good idea to list the resources we believe are the most relevant.
Note: This link, http://www.osdata.com/holistic/security/attacks/racecond.html, is broken.
I only have one resource to add. I found a paper that summarizes information about Therac-25 and the blackout of 2003: http://x4.6times7.org/downloads/software_catastrophes.pdf.
4.1 Blackout (pg. 5 – 6)
4.3 Therac-25 (pg. 7 – 8)
I think we should agree on a thesis soon. Currently the examples in our essay are not connected by a central argument. If we have time, I think we should try to find another example (assuming we have agreed to write about Therac-25, the blackout of 2003 and the Mars rovers). Prof. Anil said that he was expecting four to five examples. Three examples is a minimum. I have been trying to search for one that is not as well known (as encouraged in class) but I have not had any luck.
Are the series of Mars rovers (Opportunity and Spirit from 2004-2005) the most recent examples? I have not found any that are more recent so far. I wonder if systems programmers have learned from these past failures. I noticed, while searching for resources, that researchers have developed/are now developing tools and strategies to detect race conditions.
Lastly, what is our plan on how divide the work for this essay? Also do we want to meet in person someday?
--J powers 16:08, 9 October 2010 (UTC)
One suggestion I have for dividing the work is for everyone to write a paragraph of the essay or about a specific disaster. --J powers 16:50, 9 October 2010 (UTC)
Cool, its good to have the other members of the group on board. I will handle the editing and the introductory paragraph. I will try to make it as academic as possible.
What Julie mentioned is right. The prof said that 3 examples are alright. But he's really looking for 4-5 cases. We need to impress him a little bit here. The other case he mentioned was the Blue-Screens-Of-Death incidents. I believe a mail man was killed because of that. I will try to find some information on that later on today.
Also, if you guys wanna meet up a couple of days before the due date, thats ok by me. We can meet up in the Herzberg labs in the 4th floor, not the undergrad ones, the ones at the end of the hall. Or I can reserve a room for us in the library. Or if you just want to continue doing this online, I know that each one of us has probably a different schedule and everything.
-- Munther --Hesperus 16:21, 11 October 2010 (UTC)
Alright, Seems we needed more than i originally thought :p so i tweaked the other page to have 5 of them instead of 3. I would absolutely like to meet up :D. Doing this online thing makes me feel wierd for some reason...
But if we do meet up lets put all our discussion and decisions on the page here so it can get reviewed etc.
If we are gonna meet up I would prefer Herzberg (not that it really matters, its just where i hang out anyways)
Also is this due on tuesday or thursday?
Dsont 03:06, 11 October 2010 (UTC) this date is wrong for this edit
Started using tildes now thanks julie
--- Ok everyone write in here when you are available before the 14th
Daniel: all day Monday, Tuesday, and Thursday Munther: -- Fangchen: -- Andrew: After 12:30 Tues-Wed-Thurs Julie: Tuesday after 2:30, and Wednesday/Thursday after 1:00 J powers 19:32, 10 October 2010 (UTC) cha0s: monday in the afternoon, tuesday after 1, and all day wednesday
Hey Everyone. Awesome looks like we have a lot of information and resources here to work from. Daniels template structure looks good and we should follow that. We should come up with a plan for executing this, what topics we want to cover and who would like to focus on what. I think the 3 big examples we've found lots of resources for are the Therac-25, Mars Rover and the Blackout. The professor mentioned he'd like to see some more exotic examples lets try and find some for examples 4/5.
Layout we can build on.
Introduction
Therac-25
Mars Rover
Blackout
Example 4
Example 5
Conclusion
I'm going to try and read up a bit more on the Therac-25 and put in a few paragraphs today.
Atubman 21:55, 10 October 2010 (UTC) (did not know about the 4 tildes thing, thanks for sharing)
I do not mind which topic I write about but I feel a personal connection with the blackout. My hometown was affected for a long time and there were concerns about chemical plants nearby. Therefore I have an interest in writing/researching about it.
Has the group member above (Could you please put your name? Was it Andrew?) decided on Therac-25 then?
Also I have noticed that everyone has not been using 4 tildes. I am not sure if this how the professor knows who wrote what but it would not hurt to use it (Less to type as well).
Any ideas on a deadline for all of our writing?
J powers 21:05, 10 October 2010 (UTC)
I tried writing up a bit about the Therac-25. Still pretty rough but its a start.
Good information in this paper http://sunnyday.mit.edu/papers/therac.pdf
Pages 22-28 deal with the software bug
Atubman 23:27, 10 October 2010 (UTC)
Yo, I'm guessing I'm the last member, putting us at 6. I'll post what I've got for my section later tonight. I'm good to meet monday in the afternoon, tuesday after 1, and all day wednesday.
cha0s 20:00, 10 October 2010 (EDT)
Looks like tuesday is a good day, wait to see for the rest to confirm? Dsont 03:08, 11 October 2010 (UTC)
Yo, after looking around a bit, it seems like it might be better to just cover three topics in greater depth, as the three we have currently have a lot of documentation. This will also demonstrate the ability we have to work together more so than us doing a seperate paragraph each
cha0s 3:02, 11 October 2010 (EDT)
Hey guys. Like I mentioned before, I will handle the editing, introductory paragraph, conclusions and the Mars-Rover incidents case. In the mean time, I strongly urge other members of the group to look into the Blackout case and try to find us another case like the Blue-Screens-of-Death which the prof mentioned in class. Most of the cases I found were all software related. Nothing major. So it would be great to have someone help with the research. We we will try as much as possible to deliver 4 cases.
-- Munther --Hesperus 16:21, 11 October 2010 (UTC)
I've been looking for a while now, and I can't find any major system failures related to the topic except the three we already have. I'll focus my research on the blackout case for now.
cha0s 16:34, 11 October 2010 (EDT)
Posted a rough section for the 2003 Black-Out. Will add citations and contribute to the Therac 25 section later tonight. If anyone has found a fourth topic, post it and i'll try and find some more info on it.
cha0s 18:54, 11 October 2010 (EDT)
Hey guys. I've edited the article, provided an introduction and an overview piece. Plus, I've posted the first part of the Mars-Rover incident. This is just a rough version. The article of course needs further editing. I will keep editing and updating the Mars-Rover case in the next 24 hours. I also started a section for the Blue-Screens-Of-Death incidents. I don't think theres any harm in doing that, I've found that this was a fairly common problem in some versions of Windows leading to a handful of system failures in airports, electronic hoardings, it even happened at the Beijing Summer Olympics of 2008 ! So this could be a potential case as well. I will try to consult the prof regarding this today, he might provide us with some hints or crucial talking points.
Munther --Hesperus 06:20, 12 October 2010 (UTC)
I guess ill do Blue Screens then
Dsont 13:36, 12 October 2010 (UTC)
Ok, so in today's lecture, Thomas (chaOs) inquired about the essay and the prof mentioned that three cases would be enough. But if we wanna go fancy, a fourth case might be a good idea. I think it would be a lot better if we we focus on the three cases at hand and leave the blue-screens-of-death to the end. The prof also talked about plagiarism and emphasized the need to be original. Even if we cite the resources, the article itself has to be original in the sense that it carries through the reader's understanding. So no copy and pasting will be tolerated. In fact, I'm going back to the Mars-Rover incident to do a re-edit and make sure theres no direct phrasing or imitation of style. He suggested that it would be a good idea to read and understand the article and then put it away and try to phrase and deliver the concepts and notions using one's words. It would be ok to use the exact scientific terms, though. Theres no escaping that I guess.
Munther --Hesperus 14:35, 12 October 2010 (UTC)
Hey, If you guys want more things to talk about, the Linux kernel has suffered many a race condition failure leading to security vulnerabilities that allow root / kernel level access. I remember one from a while ago that hit Slashdot where a local user could cause a race condition that caused a null pointer (a pointer that's essentially set to 0x00000000) to be dereferenced resulting in the kernel trying to execute at address 0. Now if you stick your own code at 0, you can now run your own code in the kernel ;)
--3maisons 19:19, 12 October 2010 (UTC)
Hey guys, I saw that there might be some documentation lack of blue-screen-death. I found this article of how the problem of blue screen occurs. http://books.google.com/books?hl=zh-CN&lr=&id=2bGxMzOtUMsC&oi=fnd&pg=PR15&dq=Blue-Screens-of-Death&ots=aYecJYK84q&sig=vXttqNmGEONz3K8Txt3PkLsJze4#v=onepage&q=Blue-Screens-of-Death&f=false
On page 54, it described the reason why that happened.
And here is an example how blue-screen affects people's life. I think this book might be useful since it is related to software performance.
BTW,i'll be available the whole afternoon tomorrow.
---Fangchen
I found the only explain of BOSD is that error 0X0000001a occurs because of the race condition of memory usage, but there is no further explain. Have any one found something on that?
---Fangchen 21:40, 14 October 2010
Yo, I'll be at herzberg around 12-12:30 tommorow if you guys want to meet up.
--cha0s 3:40, 13 October 2010
I'm currently having office hours in HP 1175 from 10 am - 12 pm. I will try to drop by the labs in the third and fourth floor to meet up with chaOs. Anyways, I will be finishing the Mars-Rovers part today and I will re-edit the overview and the introduction as well. Other members of the group should probably help with the Therac-25, that case is supposed to be the most important one in the whole essay.
Munther --Hesperus 14:01, 13 October 2010 (UTC)
Just re-edited the Mars Rover and BSOD sections (just added a few examples to the incident, didn't alter the main content). Provided resources as well.
Munther --Hesperus 15:44, 13 October 2010 (UTC)
I'm in the lounge right now.
--cha0s 11;57, 13 October 2010 (UTC)
Sorry dude. I had to leave. Best chance for us is to meet tomorrow after the lecture. Like mentioned before, I will make sure that the Mars-Rover section is finished today. chaOs is doing the Blackout. I don't think theres much to add to the BSOD. Atubman wrote the first blurb about the Therac-25, if you could go back and to refine it a little bit and provide the resources, that would be great. Other members should help as well. I'll try to do the conclusions today If I could. I'm also thinking about seeing the prof tomorrow in his office hours, he might give us some tips as far as presenting the cases and all.
Munther --Hesperus 18:44, 13 October 2010 (UTC)
Sorry I have not been participating lately. I had a group presentation today but now I am free to work on this essay. I will gladly meet after class tomorrow and help until 3007. After 3007, I can work for the rest of the day. Tonight I will try to read about Therac-25 and write more in that section. I also have ideas to contribute to the blackout section.
J powers 21:02, 13 October 2010 (UTC)
Hey guys. Just did another edit. The Rover case is now finished. I can also see that Atubman refined the Therac-25 case. I added a single line to that section, again, I didn't alter the main content at all.
Wrote a little something for the conclusions and moved the mutual exclusion paragraph from the overview to the conclusions, since we didn't really talk about any mutual exclusion techniques or solution throughout the cases, so why mention them there ? However, having them in the conclusions section at the end is a bit jerky I guess, because we're introducing this whole concept at the end of the article. Also, the resources used throughout the article must be mentioned in the resources section.
If anyone wants to help with the editing as far as grammar or vocab goes, please do so. I will be seeing the prof in his office hours tomorrow, if anyone wants to join me, that would be great. After our lecture, I have a class from 11:30 to 1:00 pm and then another one from 4:30 pm to 5:30 pm, in case you guys wanna meet up.
I think we're pretty much set to go. The prof wanted three cases, we did four, so this has to mean something.
Munther --Hesperus 05:34, 14 October 2010 (UTC)
I am currently in HP4115 if anyone is around. Or is everyone meeting somewhere else? Munther, I can come with you after 3007 to talk to Anil. I need to ask him about what I am planning to contribute.
J powers 14:24, 14 October 2010 (UTC)
Hey Julie. Yeah I'm definitely seeing the prof today at 1:00 pm, so I'll see you there. I think the essay is pretty much done, we just need to refine the conclusion a little bit, and thats what I'm planning on asking him. Also, guys please add the resources that were used, we don't wanna get into any trouble.
Also, I'm currently thinking of some potential questions that we might add to the end of the essay, like the prof suggested today. Heres some ideas:
- What is the main idea behind race conditions errors ?
Answer: more like a definition.
- What are some of the techniques used to establish mutual exclusion and how do they work ?
Answer: locks, semaphores, busy waiting & monitors. Refer to the textbook for the details.
- How does Windows and Linux differ in terms of handling race conditions and applying mutual exclusion ?
Answer: I honestly have no idea, but I'm pretty sure Linux uses semaphores. I will discuss this with the prof today.
- What are the mechanisms that Linux uses to apply mutual exclusion (or even synchronization for that matter) ?
Answer: Semaphores, pipes, signals. Processes can generate signal to notify other processes that a specific event is occurring in a particular data structure.
I might add this section today prior to midnight if I end up with some potential talking points. I will also edit the overview and the conclusion.
Munther --Hesperus 14:48, 14 October 2010 (UTC)
I am working on revising at the moment. I read through and revised the introduction.
The first question is fine but I do not see how the last two (possibly three; we do talk about techniques and Windows briefly) questions relate to our essay specifically. They relate more to the classroom material. Maybe we should have something like "Describe (at least? or three?) two famous system failures caused by race conditions. Why did they occur and what were the consequences of their failures?".
J powers 15:12, 14 October 2010 (UTC)
I'm in going to see the prof right now. Yeah, the questions somehow relate more to the class material.
Munther --Hesperus 16:58, 14 October 2010 (UTC)
I'll be on later tonight. I'll expand the black-out section and contribute anything i find to the other sections then.
--cha0s 14:24, 14 October 2010 (UTC)
I'm in the library, 4th floor, near the computers if anyone wants to join me. If you're in the lower flowers, just post something here and I'll come down to see you. I'll be here for the next 2 or 3 hours.
Munther --Hesperus 18:28, 14 October 2010 (UTC)
Julie and I are in the 4th floor of Herzberg labs, its the graduate lab at the end of the hall. We will be here for the next 3 or 4 hours.
Munther --Hesperus 18:52, 14 October 2010 (UTC)
Thesis
Everyone we need to agree on a thesis ASAP. Our cases are not connected. The Professor told us to look for patterns that are common to each case. We should incorporate these into each section and form of thesis around them as well. J powers 18:58, 14 October 2010 (UTC)
Common:
- Unexpected cases (infrequent occurrences and hard to duplicate conditions that caused the failure)
- Inability to test for all real-life situations (before release)
- Type of programming language (C/C++, Assembly)
- No ideas about the root of the failure (each case required varied amounts of time to find it)
- At least 1 recurrence (except for the blackout)
- Human error (especially in Therac-25 and the blackout; preventable)
- Race conditions are a common problem
- Software Design (poor)
J powers 20:14, 14 October 2010 (UTC)