<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://homeostasis.scs.carleton.ca/wiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Spanke</id>
	<title>Soma-notes - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://homeostasis.scs.carleton.ca/wiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Spanke"/>
	<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php/Special:Contributions/Spanke"/>
	<updated>2026-05-01T17:08:55Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.42.1</generator>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_11&amp;diff=6196</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_11&amp;diff=6196"/>
		<updated>2010-12-02T05:21:02Z</updated>

		<summary type="html">&lt;p&gt;Spanke: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;--[[User:Spanke|Spanke]] 00:19, 2 December 2010 (UTC) Finished Timers, I hate 3004...&lt;br /&gt;
&lt;br /&gt;
--[[User:Jjpwilso|Jjpwilso]] 00:03, 2 December 2010 (UTC) I&#039;ll check for some more references on the ACM and IEEE databases. In the meantime I thought I&#039;d mention what Anil said regarding critique. He suggested we should consider other approaches to the same solution, such as modifying NTP with a different heuristic. I&#039;ll see what I can dig up in other papers on NTP.&lt;br /&gt;
&lt;br /&gt;
--[[User:ScottG|ScottG]] 22:06, 1 December 2010 (UTC) I&#039;m assuming you meant for me to add my references, yes? I really only used the article, and &#039;Timekeeping in Virtual Machines&#039; which I went to add, but is already on there. I&#039;ve looked for other articles to try to get how others have looked at it that aren&#039;t VMware, but there really isn&#039;t a huge amount out there dealing &#039;&#039;specifically&#039;&#039; with guest timekeeping (unless I&#039;ve gone Google-blind, which has admittedly happened before). Mostly I ran into links pointing to that specific article.&lt;br /&gt;
&lt;br /&gt;
--[[User:Sblais2|Sblais2]] 17:50, 1 December 2010 (UTC) I added stuf into the Research problems. I think I summarized most of them. If I forgot any, please add them in. I also added the missing references in the reference section. For Fedor, we seem to miss some content in 2 sections. Also, you could read through the other section and add/change some pertinent information that might&#039;ve been missed that would make this essay even better.&lt;br /&gt;
&lt;br /&gt;
--[[User:Sblais2|Sblais2]] 15:11, 1 December 2010 (UTC) Would it be possible to add your references at the bottom please? Even if it is a link. I have added the article link at the top of the essay.&lt;br /&gt;
&lt;br /&gt;
Hey guys, sorry its taken me a while to post here. If there is a particular topic that needs researching, I could spend some hours doing that tomorrow - suggestions? Also, I intend to fix up the style&amp;amp;structure after everything is done as I am quite good with that. &lt;br /&gt;
&lt;br /&gt;
Fedor&lt;br /&gt;
&lt;br /&gt;
--[[User:ScottG|ScottG]] 21:36, 26 November 2010 (UTC) So I was a little (more than a little) behind on my initially estimated time for getting stuff up on Guest Timekeeping, but that&#039;s the gist of it there now. I&#039;m going to try to buff it up a bit before it&#039;s due, since what I put in is a bit rougher than I&#039;d like. If I seem to be missing something that should be pretty obvious, let me know and I&#039;ll work it in.&lt;br /&gt;
&lt;br /&gt;
--[[User:Jjpwilso|Jjpwilso]] 15:49, 23 November 2010 (UTC) I&#039;ve been completely swamped with COMP3004 stuff (among other things) and feeling guilty as hell about this essay. The good news, for those who might have missed today&#039;s lecture, is we have an extension of one week. Phew!!&lt;br /&gt;
&lt;br /&gt;
--[[User:Sblais2|Sblais2]] 21:29, 22 November 2010 (UTC) I have added a small part to the background section. I have created by hand a diagram explaining how it works. I tried to find an original way of doing it but it is the same diagram everywhere. Please feel free to comment here or by sending me an email.&lt;br /&gt;
&lt;br /&gt;
--[[User:AbsMechanik|AbsMechanik]] 19:46, 22 November 2010 (UTC) Here&#039;s what my research has led me to so far. I&#039;m trying to come up with good points for the research problem, contribution and critique part of this essay. Here&#039;s a bunch of links, I&#039;ve come across. I think there will be a few more tonight. Feel free to read through &#039;em: &lt;br /&gt;
&amp;lt;br&amp;gt;http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
&amp;lt;br&amp;gt;http://www.xen.org/files/xen_interface.pdf&lt;br /&gt;
&amp;lt;br&amp;gt;http://www.microsoft.com/whdc/system/sysinternals/mm-timer.mspx&lt;br /&gt;
&amp;lt;br&amp;gt;http://www.intel.com/hardwaredesign/hpetspec_1.pdf&lt;br /&gt;
&amp;lt;br&amp;gt;http://www.cubinlab.ee.unimelb.edu.au/radclock/&lt;br /&gt;
&lt;br /&gt;
--[[User:ScottG|ScottG]] 18:55, 22 November 2010 (UTC) I&#039;m good taking the Guest Timekeeping section. Hopefully I&#039;ll have some stuff up tonight or early tomorrow for it.&lt;br /&gt;
&lt;br /&gt;
--[[User:Sblais2|Sblais2]] 17:14, 22 November 2010 (UTC) I will be working on the Background section. I will dedicate it to explain some of the key concepts that are used in the research paper that will allow the readers to have a better understanding on the rest of our essay. The structure you&#039;ve put in place looks good but it might get modified, depending on the text will flow. The diagram is a good idea. I will drawn a simple one and add it in. Feel free again to critique.&lt;br /&gt;
&lt;br /&gt;
--[[User:Jjpwilso|Jjpwilso]] 15:12, 16 November 2010 (UTC) I wanted to get a structure started, so I have stubbed out the first section. Note: some of the sub-sections might belong in the Research Problem section but we can easily move them if they fit there. Let&#039;s use this area to plan who is doing what. Feel free to critique any of my submissions. When you comment here, please put your comments at the very top so we can easily see recent posts.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Participants=&lt;br /&gt;
(X) Blais   Sylvain sblais2 - Email: syl20blais@gmail.com&amp;lt;br&amp;gt;&lt;br /&gt;
(X) Graham  Scott   sgraham6&amp;lt;br&amp;gt;&lt;br /&gt;
(X) Ilitchev Fedor  filitche fedor dot ilitchev at gmail dot com &amp;lt;br&amp;gt; &lt;br /&gt;
(X) Panke   Shane   spanke&amp;lt;br&amp;gt;&lt;br /&gt;
(X) Shukla  Abhinav ashukla2&amp;lt;br&amp;gt;&lt;br /&gt;
(X) Wilson  Robert  jjpwilso&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=6190</id>
		<title>COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=6190"/>
		<updated>2010-12-02T04:59:39Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Timers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Virtualize Everything But Time =&lt;br /&gt;
Article written by Timothy Broomhead, Laurence Cremean, Julien Ridoux and Darrel Veitch. They are working for the Center for Ultra-Broadband Information Networks (CUBIN) Department of Electrical &amp;amp; Electronic Engineering at the University of Melbourne in Australia. Here is the link to the article: [http://www.usenix.org/events/osdi10/tech/full_papers/Broomhead.pdf]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Background Concepts=&lt;br /&gt;
&lt;br /&gt;
The next time you notice one stranger ask another for the time and you see them check their watch, try this experiment: immediately ask too. Chances are the person will check their watch again. Why? Human internal clocks are notoriously unreliable. Our sense of time contracts and expands all day long. We seem to believe that a definitive report of time can only come from some mechanical or electronic source. So social norms require that the watch owner provides you with two things: 1) the time, and 2) a gesture of external authority, i.e. a glance at their watch.&lt;br /&gt;
&lt;br /&gt;
The story of time inside a virtual machine is almost as unreliable as our own internal clocks. How much time has elapsed since a VM client got the CPU&#039;s attention? At the best of times there&#039;s no way for it to guess because it wasn&#039;t actually running. If the VM was suspended and migrated from one physical host to another its concept of time is even worse. This paper is about how a computer glances at its metaphorical watch, and what kinds of timepieces it has at hand.&lt;br /&gt;
&lt;br /&gt;
To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.&lt;br /&gt;
&lt;br /&gt;
===Timekeeping===&lt;br /&gt;
&lt;br /&gt;
For thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.&lt;br /&gt;
&lt;br /&gt;
===Clocks===&lt;br /&gt;
&lt;br /&gt;
Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:&lt;br /&gt;
&lt;br /&gt;
Diagram1. Timer Abstraction&lt;br /&gt;
&lt;br /&gt;
[[File:Timerabstract.jpg]]&lt;br /&gt;
&lt;br /&gt;
This diagram nicely represents how tick counting works. The oscillator runs at a predetermined frequency. The operating system might have to measure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again. Not all hardware timers work exactly like that. For instance, some actually count up, others don&#039;t use interrupts, and yet others don&#039;t keep an initial counter. The general principle of hardware counters is the however the same. There is some kind of fixed interval at the end of which the current time is updated by an appropriate number of units (i.e. nanoseconds).&lt;br /&gt;
&lt;br /&gt;
===Timers===&lt;br /&gt;
# PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. As such, we only need to concern ourselves with Channel 0. [http://www.osdever.net/bkerndev/Docs/pit.htm]&lt;br /&gt;
# CMOS RTC, also known as a CMOS battery, allows the CMOS chip to remain powered to keep track of things like time even while the physical PC unit has no source of power. If there is no CMOS battery on the motherboard, the computer would reset to its default time each restart. The battery itself can die, as expected, if the computer is powered off and not used for a long period of time. This can cause issues with the main OS as well as the VM. [http://kb.iu.edu/data/adoy.html]&lt;br /&gt;
# Local APIC handles all external interrupts for the processor in the system. It can also accept and generate inter-processor interrupts between Local APICs. [http://developer.intel.com/design/pentium/datashts/24201606.pdf]&lt;br /&gt;
# ACPI establishes industry-standard interfaces configuration guided by the OS and power management. It is industry-standard through its creators, Intel, Microsoft, Phoenix, Hewlett Packard and Toshiba. Its power management includes all forms: notebooks, desktops, and servers. ACPI&#039;s goal is to improve current power and configuration standards for hardware devices by transitioning to ACPI-compliant hardware. This allows the OS as well as the VM to have control over power management. [http://www.intel.com/technology/iapc/acpi/][http://www.acpi.info/][http://www.acpi.info/DOWNLOADS/ACPIspec40a.pdf]&lt;br /&gt;
# RDTSC is based on the x86 P5 instruction set and perform high-resolution timing, however, it suffers from several flaws. Discontinuous values from the processor are caused as a result of not using the same thread to the processor each time, which can also be caused by having a multicore processor. This is made worse by ACPI which will eventually lead to the cores being completely out of sync. Availability of dedicated hardware: &amp;quot;RDTSC locks the timing information that the application requests to the processor&#039;s cycle counter.&amp;quot; With dedicated timing devices included on modern motherboards this method of locking the timing information will become obsolete. Lastly, the variability of the CPU&#039;s frequency needs to be taken into account. With modern day laptops, most CPU frequencies are adjusted on the fly to respond to the users demand when needed and to lower themselves when idle, this results in longer battery life and less heat generated by the laptop but regretfully affects RDTSC making it unreliable. [http://msdn.microsoft.com/en-us/library/ee417693%28VS.85%29.aspx]&lt;br /&gt;
# HPET defines a set of timers that the OS has access to and can assign to applications. Each timer can generate an interrupt when the least significant bits are equal to the equivalent bits of the 64-bit counter value. However, a race case can occur in which the target time has already passed. This causes more interrupts and more work even if the task is a simple one. It does produce less interrupts than its predecessors PIT and CMOS RTC giving it an edge. Despite its race condition, this modern timer is improvement upon old practices.  [http://hackipedia.org/Hardware/HPET,%20High%20Performance%20Event%20Timer/IA-PC%20HPET%20%28High%20Precision%20Event%20Timers%29%20Specification.pdf]&lt;br /&gt;
&lt;br /&gt;
==Guest Timekeeping==&lt;br /&gt;
&lt;br /&gt;
Guest timekeeping is done using the same general methods as any computer timekeeping, using either tick counting or tickless systems. Where the two begin to differ, however, is that a host operating system is able to communicate directly with the physical hardware, while the guest operating system is unable to do so, having to communicate with the host system that it wants to communicate with the hardware. Having to do this is the greatest source of the guest operating system&#039;s clock losing accuracy, or more simply called drifting.&lt;br /&gt;
&lt;br /&gt;
===Sources of Drift===&lt;br /&gt;
&lt;br /&gt;
When a guest operating system is started, its clock simply synchronizes with the host&#039;s – some virtual machines such as VMware also do this when it is resumed from a suspended state, or restored from a snapshot – so it is easy to think that, since it starts off correctly the guest&#039;s clock will continue to be correct. That is, of course, incorrect. The first source of drift is simply due to the drift a host incurs in its own timekeeping. A clock is almost never entirely accurate, having a slight error due to the time used to communicate with the counter, even on the host system, and because the guest communicates with the host in order to keep track of its time, an error in the host&#039;s time is not only passed on to the guest, but because the host is trying to correct its own time the guest&#039;s request for a count is given slightly less priority, making it yet again lose accuracy. The larger the drift in the host, the larger the drift in the guest, as the host&#039;s drift simply compounds the issue.&lt;br /&gt;
&lt;br /&gt;
Aside from the host&#039;s own drift, the other cause of drift in the virtual environment is the fact that the it is treated like a process by the host. In and of itself this doesn&#039;t seem like a problem, but because of it it can be denied the CPU time required, or allocated less memory than needed. With restricted CPU time, it&#039;s easy for the requested ticks or requested read of a counter to pile up and create a backlog of requests, or simply receive the requested data late enough to throw its clock off. With memory, if the virtual environment does not have enough allocated to it by the host it can run into the problem of swapping out pages that are needed soon. Swapping the pages back in will momentarily bring the entire virtual environment to a halt, so ticks are missed and the clock falls behind.&lt;br /&gt;
&lt;br /&gt;
===Impact of Drift===&lt;br /&gt;
&lt;br /&gt;
The impact of drift essentially boils down to round-off errors and lost ticks. The practical impact of drift, however, is quite apparent in any automated system. For a relatable real-world example, though not in a virtual environment, in a factory&#039;s assembly line, the machinery is finely tuned to do its own specific part at certain intervals, and it generally does so with impressive efficiency. If the clock in the system were to drift, however, a specific machine may move too soon or too late, bringing the line to a potentially catastrophic halt. In a virtual environment, drift is a bit more subtle, as one result of it could be skewed process scheduling – some schedulers give a certain amount of time to a process before moving on, but if the guest&#039;s time has drifted substantially, when it tries to correct its time it could give more or less time to the processes in the scheduler.&lt;br /&gt;
&lt;br /&gt;
===Compensation Strategies===&lt;br /&gt;
&lt;br /&gt;
There are a number of compensation strategies for dealing with drift, depending on the cause of it. If the problem is due to CPU management issues, then the host can give more CPU time to the virtual machine, or it can lower the timer interrupt rate – or simply use a tickless counter. If it is due to a  memory management issue, allocating more memory to the virtual environment should prevent the system from needing to swap out page files so often.&lt;br /&gt;
&lt;br /&gt;
If the issue is from neither of those, but simply due to the inevitable lag when the guest communicates with the hardware via the host, then there are other methods to correct the drift. Most systems natively have algorithms built in to correct the time if it gets too far ahead or behind real time, though they are not without their own faults; if the time is set ahead when catching up, the backlog of ticks it has built up may not be cleared, so it could potentially set itself ahead multiple times until the backlog is dealt with. Tools built into the virtual machine itself can also deal with drift to an extent, as VMware Tools does. This kind of tool checks to see if the clock&#039;s error is within a certain margin. If it exceeds the margin, then the backlog is set to zero – to prevent the issue mentioned with the native algorithms – and resynchronizes with the host clock before the guest goes back to keeping track of time as it normally would.&lt;br /&gt;
&lt;br /&gt;
=Research problem=&lt;br /&gt;
&lt;br /&gt;
Today, the use of the Network Time Protocol and of daemons like ntpd is the dominant solution for accurate timekeeping. In optimal conditions, the ntpd can be very good but these situations rarely happen. Network congestion, disconnections, lower quality networking hardware and unsuspected system events can create offsets errors in the order of 10‘s or even 100 milliseconds(ms). [http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/tscclock_final.pdf]&lt;br /&gt;
For demanding applications, this is neither robust or reliable. One way to enhance the performance of ntpd would be to poll from the NTP server more often as this would reduced the offset error but unfortunately, this would increase the network traffic which could cause network congestions which would raise the offset error. So this won’t work. &lt;br /&gt;
&lt;br /&gt;
Another problem with current system software clocks using NTP(like ntpd), is that they provide only an absolute clock.[http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/synch_ToN.pdf]&lt;br /&gt;
So for applications that deals with network managements and measurements, this is unsuitable. Why? Because NTP focus on offset and not on hardware clock oscillator rate. For example, when calculating delay variations, the offset error doesn’t change anything to the calculations but the clocks’ oscillator rate variation does affect it. So having a more accurate timestamp would make those calculation more precise. Which mean we would need another system software clock.&lt;br /&gt;
&lt;br /&gt;
In virtualization(in this case Xen), when migrating a running system from one system to another can cause issues and this is again caused by the ntpd daemon. By default, each guest OS runs its own instance of the ntpd daemon. So the synchronization algorithm keeps track of the reference wallclock time, rate-of-drift and current clock error, which are defined by the hardware clock on the system. So when migrating the virtualized OS to another system, the ntpd state is saved and when it is enabled again on the new system, thats where the problems starts. Because no two hardware clocks drifts the same way or have the exact same wallclock time, all the information traced by the daemon are all of a sudden inaccurate. This could prove disastrous to the system. This could go from a slowly recoverable error to one where ntpd might never recover, making the virtualized OS unstable.&lt;br /&gt;
&lt;br /&gt;
=Contribution=&lt;br /&gt;
&lt;br /&gt;
(sections are stubs for the moment ... more to come)&lt;br /&gt;
The contributions of this paper were:&lt;br /&gt;
&lt;br /&gt;
* baseline evaluations of:&lt;br /&gt;
** performance of NTP in dependent and independent configurations&lt;br /&gt;
** Xen Clocksource as a basis counter under NTP&lt;br /&gt;
** latencies of different clock sources&lt;br /&gt;
** implications of Power Management&lt;br /&gt;
&lt;br /&gt;
* new architecture&lt;br /&gt;
** RADclock&lt;br /&gt;
** XenStore as holder of clock parameter data&lt;br /&gt;
** feed-forward versus feedback&lt;br /&gt;
&lt;br /&gt;
* evaluation of RADclock vs ntpd&lt;br /&gt;
&lt;br /&gt;
=Critique=&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
1. &amp;quot;Virtualize Everything But Time&amp;quot; by Timothy Broomhead, Laurence Cremean, Julien Ridoux and Darrel Veitch, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Broomhead.pdf&lt;br /&gt;
&lt;br /&gt;
2. &amp;quot;Timekeeping in Virtual Machines, Information Guide&amp;quot; from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
&lt;br /&gt;
3. &amp;quot;Bran&#039;s Kernel Development Tutorial&amp;quot; from Bona Fide OS Developer website. http://www.osdever.net/bkerndev/Docs/pit.htm  &lt;br /&gt;
&lt;br /&gt;
4. &amp;quot;What is a CMOS battery, and why does my computer need one?&amp;quot; from the Indiana University&#039;s Knowledge Base, 2010. http://kb.iu.edu/data/adoy.html&lt;br /&gt;
&lt;br /&gt;
5. &amp;quot;Multiprocessor Specification version 1.4&amp;quot; from Intel, 1997. http://developer.intel.com/design/pentium/datashts/24201606.pdf&lt;br /&gt;
&lt;br /&gt;
6. &amp;quot;PC Based Precision Timing Without GPS&amp;quot; by Attila Pa ́sztor and Darryl Veitch, 2002. http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/tscclock_final.pdf&lt;br /&gt;
&lt;br /&gt;
7. &amp;quot;Robust Synchronization of Absolute and Difference Clocks over Networks&amp;quot; by Darryl Veitch, Julien Ridoux and Satish Babu Korada, 2009. http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/synch_ToN.pdf&lt;br /&gt;
&lt;br /&gt;
8. Broomhead, T.; Ridoux, J.; Veitch, D.; , &amp;quot;Counter availability and characteristics for feed-forward based synchronization,&amp;quot; Precision Clock Synchronization for Measurement, Control and Communication, 2009. ISPCS 2009. International Symposium on , vol., no., pp.1-6, 12-16 Oct. 2009&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=6169</id>
		<title>COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=6169"/>
		<updated>2010-12-02T04:05:55Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Timers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Virtualize Everything But Time =&lt;br /&gt;
Article written by Timothy Broomhead, Laurence Cremean, Julien Ridoux and Darrel Veitch. They are working for the Center for Ultra-Broadband Information Networks (CUBIN) Department of Electrical &amp;amp; Electronic Engineering at the University of Melbourne in Australia. Here is the link to the article: [http://www.usenix.org/events/osdi10/tech/full_papers/Broomhead.pdf]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Background Concepts=&lt;br /&gt;
&lt;br /&gt;
The next time you notice one stranger ask another for the time and you see them check their watch, try this experiment: immediately ask too. Chances are the person will check their watch again. Why? Human internal clocks are notoriously unreliable. Our sense of time contracts and expands all day long. We seem to believe that a definitive report of time can only come from some mechanical or electronic source. So social norms require that the watch owner provides you with two things: 1) the time, and 2) a gesture of external authority, i.e. a glance at their watch.&lt;br /&gt;
&lt;br /&gt;
The story of time inside a virtual machine is almost as unreliable as our own internal clocks. How much time has elapsed since a VM client got the CPU&#039;s attention? At the best of times there&#039;s no way for it to guess because it wasn&#039;t actually running. If the VM was suspended and migrated from one physical host to another its concept of time is even worse. This paper is about how a computer glances at its metaphorical watch, and what kinds of timepieces it has at hand.&lt;br /&gt;
&lt;br /&gt;
To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.&lt;br /&gt;
&lt;br /&gt;
===Timekeeping===&lt;br /&gt;
&lt;br /&gt;
For thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.&lt;br /&gt;
&lt;br /&gt;
===Clocks===&lt;br /&gt;
&lt;br /&gt;
Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:&lt;br /&gt;
&lt;br /&gt;
Diagram1. Timer Abstraction&lt;br /&gt;
&lt;br /&gt;
[[File:Timerabstract.jpg]]&lt;br /&gt;
&lt;br /&gt;
This diagram nicely represents how tick counting works. The oscillator runs at a predetermined frequency. The operating system might have to measure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again. Not all hardware timers work exactly like that. For instance, some actually count up, others don&#039;t use interrupts, and yet others don&#039;t keep an initial counter. The general principle of hardware counters is the however the same. There is some kind of fixed interval at the end of which the current time is updated by an appropriate number of units (i.e. nanoseconds).&lt;br /&gt;
&lt;br /&gt;
===Timers===&lt;br /&gt;
# PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. As such, we only need to concern ourselves with Channel 0. [http://www.osdever.net/bkerndev/Docs/pit.htm]&lt;br /&gt;
# CMOS RTC, also known as a CMOS battery, allows the CMOS chip to remain powered to keep track of things like time even while the physical PC unit has no source of power. If there is no CMOS battery on the motherboard, the computer would reset to its default time each restart. The battery itself can die, as expected, if the computer is powered off and not used for a long period of time. This can cause issues with the main OS as well as the VM. [http://kb.iu.edu/data/adoy.html]&lt;br /&gt;
# Local APIC handles all external interrupts for the processor in the system. It can also accept and generate inter-processor interrupts between Local APICs. [http://developer.intel.com/design/pentium/datashts/24201606.pdf]&lt;br /&gt;
# ACPI establishes industry-standard interfaces configuration guided by the OS and power management. Power Management includes notebooks, desktops, and servers. ACPI&#039;s goal is to improve current power and configuration standards for hardware devices by transitioning to ACPI-compliant hardware. This allows the OS as well as the VM to have control over power management. [http://www.intel.com/technology/iapc/acpi/][http://www.acpi.info/][http://www.acpi.info/DOWNLOADS/ACPIspec40a.pdf]&lt;br /&gt;
# RDTSC is based on the x86 P5 instruction set and perform high-resolution timing, however, it suffers from several flaws. Discontinuous values from the processor are caused as a result of not using the same thread to the processor each time, which can also be caused by having a multicore processor. This is made worse by ACPI which will eventually lead to the cores being completely out of sync. Availability of dedicated hardware: &amp;quot;RDTSC locks the timing information that the application requests to the processor&#039;s cycle counter.&amp;quot; With dedicated timing devices included on modern motherboards this method of locking the timing information will become obsolete. Lastly, the variability of the CPU&#039;s frequency needs to be taken into account. With modern day laptops, most CPU frequencies are adjusted on the fly to respond to the users demand when needed and to lower themselves when idle, this results in longer battery life and less heat generated by the laptop but regretfully affects RDTSC making it unreliable. [http://msdn.microsoft.com/en-us/library/ee417693%28VS.85%29.aspx]&lt;br /&gt;
# HPET (I will fill these in later, Dec. 1st Update) --[[User:Spanke|Spanke]]&lt;br /&gt;
&lt;br /&gt;
==Guest Timekeeping==&lt;br /&gt;
&lt;br /&gt;
Guest timekeeping is done using the same general methods as any computer timekeeping, using either tick counting or tickless systems. Where the two begin to differ, however, is that a host operating system is able to communicate directly with the physical hardware, while the guest operating system is unable to do so, having to communicate with the host system that it wants to communicate with the hardware. Having to do this is the greatest source of the guest operating system&#039;s clock losing accuracy, or more simply called drifting.&lt;br /&gt;
&lt;br /&gt;
===Sources of Drift===&lt;br /&gt;
&lt;br /&gt;
When a guest operating system is started, its clock simply synchronizes with the host&#039;s – some virtual machines such as VMware also do this when it is resumed from a suspended state, or restored from a snapshot – so it is easy to think that, since it starts off correctly the guest&#039;s clock will continue to be correct. That is, of course, incorrect. The first source of drift is simply due to the drift a host incurs in its own timekeeping. A clock is almost never entirely accurate, having a slight error due to the time used to communicate with the counter, even on the host system, and because the guest communicates with the host in order to keep track of its time, an error in the host&#039;s time is not only passed on to the guest, but because the host is trying to correct its own time the guest&#039;s request for a count is given slightly less priority, making it yet again lose accuracy. The larger the drift in the host, the larger the drift in the guest, as the host&#039;s drift simply compounds the issue.&lt;br /&gt;
&lt;br /&gt;
Aside from the host&#039;s own drift, the other cause of drift in the virtual environment is the fact that the it is treated like a process by the host. In and of itself this doesn&#039;t seem like a problem, but because of it it can be denied the CPU time required, or allocated less memory than needed. With restricted CPU time, it&#039;s easy for the requested ticks or requested read of a counter to pile up and create a backlog of requests, or simply receive the requested data late enough to throw its clock off. With memory, if the virtual environment does not have enough allocated to it by the host it can run into the problem of swapping out pages that are needed soon. Swapping the pages back in will momentarily bring the entire virtual environment to a halt, so ticks are missed and the clock falls behind.&lt;br /&gt;
&lt;br /&gt;
===Impact of Drift===&lt;br /&gt;
&lt;br /&gt;
The impact of drift essentially boils down to round-off errors and lost ticks. The practical impact of drift, however, is quite apparent in any automated system. For a relatable real-world example, though not in a virtual environment, in a factory&#039;s assembly line, the machinery is finely tuned to do its own specific part at certain intervals, and it generally does so with impressive efficiency. If the clock in the system were to drift, however, a specific machine may move too soon or too late, bringing the line to a potentially catastrophic halt. In a virtual environment, drift is a bit more subtle, as one result of it could be skewed process scheduling – some schedulers give a certain amount of time to a process before moving on, but if the guest&#039;s time has drifted substantially, when it tries to correct its time it could give more or less time to the processes in the scheduler.&lt;br /&gt;
&lt;br /&gt;
===Compensation Strategies===&lt;br /&gt;
&lt;br /&gt;
There are a number of compensation strategies for dealing with drift, depending on the cause of it. If the problem is due to CPU management issues, then the host can give more CPU time to the virtual machine, or it can lower the timer interrupt rate – or simply use a tickless counter. If it is due to a  memory management issue, allocating more memory to the virtual environment should prevent the system from needing to swap out page files so often.&lt;br /&gt;
&lt;br /&gt;
If the issue is from neither of those, but simply due to the inevitable lag when the guest communicates with the hardware via the host, then there are other methods to correct the drift. Most systems natively have algorithms built in to correct the time if it gets too far ahead or behind real time, though they are not without their own faults; if the time is set ahead when catching up, the backlog of ticks it has built up may not be cleared, so it could potentially set itself ahead multiple times until the backlog is dealt with. Tools built into the virtual machine itself can also deal with drift to an extent, as VMware Tools does. This kind of tool checks to see if the clock&#039;s error is within a certain margin. If it exceeds the margin, then the backlog is set to zero – to prevent the issue mentioned with the native algorithms – and resynchronizes with the host clock before the guest goes back to keeping track of time as it normally would.&lt;br /&gt;
&lt;br /&gt;
=Research problem=&lt;br /&gt;
&lt;br /&gt;
Today, the use of the Network Time Protocol and of daemons like ntpd is the dominant solution for accurate timekeeping. In optimal conditions, the ntpd can be very good but these situations rarely happen. Network congestion, disconnections, lower quality networking hardware and unsuspected system events can create offsets errors in the order of 10‘s or even 100 milliseconds(ms). [http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/tscclock_final.pdf]&lt;br /&gt;
For demanding applications, this is neither robust or reliable. One way to enhance the performance of ntpd would be to poll from the NTP server more often as this would reduced the offset error but unfortunately, this would increase the network traffic which could cause network congestions which would raise the offset error. So this won’t work. &lt;br /&gt;
&lt;br /&gt;
Another problem with current system software clocks using NTP(like ntpd), is that they provide only an absolute clock.[http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/synch_ToN.pdf]&lt;br /&gt;
So for applications that deals with network managements and measurements, this is unsuitable. Why? Because NTP focus on offset and not on hardware clock oscillator rate. For example, when calculating delay variations, the offset error doesn’t change anything to the calculations but the clocks’ oscillator rate variation does affect it. So having a more accurate timestamp would make those calculation more precise. Which mean we would need another system software clock.&lt;br /&gt;
&lt;br /&gt;
In virtualization(in this case Xen), when migrating a running system from one system to another can cause issues and this is again caused by the ntpd daemon. By default, each guest OS runs its own instance of the ntpd daemon. So the synchronization algorithm keeps track of the reference wallclock time, rate-of-drift and current clock error, which are defined by the hardware clock on the system. So when migrating the virtualized OS to another system, the ntpd state is saved and when it is enabled again on the new system, thats where the problems starts. Because no two hardware clocks drifts the same way or have the exact same wallclock time, all the information traced by the daemon are all of a sudden inaccurate. This could prove disastrous to the system. This could go from a slowly recoverable error to one where ntpd might never recover, making the virtualized OS unstable.&lt;br /&gt;
&lt;br /&gt;
=Contribution=&lt;br /&gt;
&lt;br /&gt;
(sections are stubs for the moment ... more to come)&lt;br /&gt;
The contributions of this paper were:&lt;br /&gt;
&lt;br /&gt;
* baseline evaluations of:&lt;br /&gt;
** performance of NTP in dependent and independent configurations&lt;br /&gt;
** Xen Clocksource as a basis counter under NTP&lt;br /&gt;
** latencies of different clock sources&lt;br /&gt;
** implications of Power Management&lt;br /&gt;
&lt;br /&gt;
* new architecture&lt;br /&gt;
** RADclock&lt;br /&gt;
** XenStore as holder of clock parameter data&lt;br /&gt;
** feed-forward versus feedback&lt;br /&gt;
&lt;br /&gt;
* evaluation of RADclock vs ntpd&lt;br /&gt;
&lt;br /&gt;
=Critique=&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
1. &amp;quot;Virtualize Everything But Time&amp;quot; by Timothy Broomhead, Laurence Cremean, Julien Ridoux and Darrel Veitch, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Broomhead.pdf&lt;br /&gt;
&lt;br /&gt;
2. &amp;quot;Timekeeping in Virtual Machines, Information Guide&amp;quot; from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
&lt;br /&gt;
3. &amp;quot;Bran&#039;s Kernel Development Tutorial&amp;quot; from Bona Fide OS Developer website. http://www.osdever.net/bkerndev/Docs/pit.htm  &lt;br /&gt;
&lt;br /&gt;
4. &amp;quot;What is a CMOS battery, and why does my computer need one?&amp;quot; from the Indiana University&#039;s Knowledge Base, 2010. http://kb.iu.edu/data/adoy.html&lt;br /&gt;
&lt;br /&gt;
5. &amp;quot;Multiprocessor Specification version 1.4&amp;quot; from Intel, 1997. http://developer.intel.com/design/pentium/datashts/24201606.pdf&lt;br /&gt;
&lt;br /&gt;
6. &amp;quot;PC Based Precision Timing Without GPS&amp;quot; by Attila Pa ́sztor and Darryl Veitch, 2002. http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/tscclock_final.pdf&lt;br /&gt;
&lt;br /&gt;
7. &amp;quot;Robust Synchronization of Absolute and Difference Clocks over Networks&amp;quot; by Darryl Veitch, Julien Ridoux and Satish Babu Korada, 2009. http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/synch_ToN.pdf&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=6145</id>
		<title>COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=6145"/>
		<updated>2010-12-02T03:36:46Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Timekeeping */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Virtualize Everything But Time =&lt;br /&gt;
Article written by Timothy Broomhead, Laurence Cremean, Julien Ridoux and Darrel Veitch. They are working for the Center for Ultra-Broadband Information Networks (CUBIN) Department of Electrical &amp;amp; Electronic Engineering at the University of Melbourne in Australia. Here is the link to the article: [http://www.usenix.org/events/osdi10/tech/full_papers/Broomhead.pdf]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Background Concepts=&lt;br /&gt;
&lt;br /&gt;
The next time you notice one stranger ask another for the time and you see them check their watch, try this experiment: immediately ask too. Chances are the person will check their watch again. Why? Human internal clocks are notoriously unreliable. Our sense of time contracts and expands all day long. We seem to believe that a definitive report of time can only come from some mechanical or electronic source. So social norms require that the watch owner provides you with two things: 1) the time, and 2) a gesture of external authority, i.e. a glance at their watch.&lt;br /&gt;
&lt;br /&gt;
The story of time inside a virtual machine is almost as unreliable as our own internal clocks. How much time has elapsed since a VM client got the CPU&#039;s attention? At the best of times there&#039;s no way for it to guess because it wasn&#039;t actually running. If the VM was suspended and migrated from one physical host to another its concept of time is even worse. This paper is about how a computer glances at its metaphorical watch, and what kinds of timepieces it has at hand.&lt;br /&gt;
&lt;br /&gt;
To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.&lt;br /&gt;
&lt;br /&gt;
===Timekeeping===&lt;br /&gt;
&lt;br /&gt;
For thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.&lt;br /&gt;
&lt;br /&gt;
===Clocks===&lt;br /&gt;
&lt;br /&gt;
Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:&lt;br /&gt;
&lt;br /&gt;
Diagram1. Timer Abstraction&lt;br /&gt;
&lt;br /&gt;
[[File:Timerabstract.jpg]]&lt;br /&gt;
&lt;br /&gt;
This diagram nicely represents how tick counting works. The oscillator runs at a predetermined frequency. The operating system might have to measure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again. Not all hardware timers work exactly like that. For instance, some actually count up, others don&#039;t use interrupts, and yet others don&#039;t keep an initial counter. The general principle of hardware counters is the however the same. There is some kind of fixed interval at the end of which the current time is updated by an appropriate number of units (i.e. nanoseconds).&lt;br /&gt;
&lt;br /&gt;
===Timers===&lt;br /&gt;
# PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. As such, we only need to concern ourselves with Channel 0. [http://www.osdever.net/bkerndev/Docs/pit.htm]&lt;br /&gt;
# CMOS RTC, also known as a CMOS battery, allows the CMOS chip to remain powered to keep track of things like time even while the physical PC unit has no source of power. If there is no CMOS battery on the motherboard, the computer would reset to its default time each restart. The battery itself can die, as expected, if the computer is powered off and not used for a long period of time. This can cause issues with the main OS as well as the VM. [http://kb.iu.edu/data/adoy.html]&lt;br /&gt;
# Local APIC handles all external interrupts for the processor in the system. It can also accept and generate inter-processor interrupts between Local APICs. [http://developer.intel.com/design/pentium/datashts/24201606.pdf]&lt;br /&gt;
# ACPI establishes industry-standard interfaces configuration guided by the OS and power management. Power Management includes notebooks, desktops, and servers. ACPI&#039;s goal is to improve current power and configuration standards for hardware devices by transitioning to ACPI-compliant hardware. This allows the OS as well as the VM to have control over power management. [http://www.intel.com/technology/iapc/acpi/][http://www.acpi.info/][http://www.acpi.info/DOWNLOADS/ACPIspec40a.pdf]&lt;br /&gt;
# TSC&lt;br /&gt;
# HPET (I will fill these in later, Dec. 1st Update) --[[User:Spanke|Spanke]]&lt;br /&gt;
&lt;br /&gt;
==Guest Timekeeping==&lt;br /&gt;
&lt;br /&gt;
Guest timekeeping is done using the same general methods as any computer timekeeping, using either tick counting or tickless systems. Where the two begin to differ, however, is that a host operating system is able to communicate directly with the physical hardware, while the guest operating system is unable to do so, having to communicate with the host system that it wants to communicate with the hardware. Having to do this is the greatest source of the guest operating system&#039;s clock losing accuracy, or more simply called drifting.&lt;br /&gt;
&lt;br /&gt;
===Sources of Drift===&lt;br /&gt;
&lt;br /&gt;
When a guest operating system is started, its clock simply synchronizes with the host&#039;s – some virtual machines such as VMware also do this when it is resumed from a suspended state, or restored from a snapshot – so it is easy to think that, since it starts off correctly the guest&#039;s clock will continue to be correct. That is, of course, incorrect. The first source of drift is simply due to the drift a host incurs in its own timekeeping. A clock is almost never entirely accurate, having a slight error due to the time used to communicate with the counter, even on the host system, and because the guest communicates with the host in order to keep track of its time, an error in the host&#039;s time is not only passed on to the guest, but because the host is trying to correct its own time the guest&#039;s request for a count is given slightly less priority, making it yet again lose accuracy. The larger the drift in the host, the larger the drift in the guest, as the host&#039;s drift simply compounds the issue.&lt;br /&gt;
&lt;br /&gt;
Aside from the host&#039;s own drift, the other cause of drift in the virtual environment is the fact that the it is treated like a process by the host. In and of itself this doesn&#039;t seem like a problem, but because of it it can be denied the CPU time required, or allocated less memory than needed. With restricted CPU time, it&#039;s easy for the requested ticks or requested read of a counter to pile up and create a backlog of requests, or simply receive the requested data late enough to throw its clock off. With memory, if the virtual environment does not have enough allocated to it by the host it can run into the problem of swapping out pages that are needed soon. Swapping the pages back in will momentarily bring the entire virtual environment to a halt, so ticks are missed and the clock falls behind.&lt;br /&gt;
&lt;br /&gt;
===Impact of Drift===&lt;br /&gt;
&lt;br /&gt;
The impact of drift essentially boils down to round-off errors and lost ticks. The practical impact of drift, however, is quite apparent in any automated system. For a relatable real-world example, though not in a virtual environment, in a factory&#039;s assembly line, the machinery is finely tuned to do its own specific part at certain intervals, and it generally does so with impressive efficiency. If the clock in the system were to drift, however, a specific machine may move too soon or too late, bringing the line to a potentially catastrophic halt. In a virtual environment, drift is a bit more subtle, as one result of it could be skewed process scheduling – some schedulers give a certain amount of time to a process before moving on, but if the guest&#039;s time has drifted substantially, when it tries to correct its time it could give more or less time to the processes in the scheduler.&lt;br /&gt;
&lt;br /&gt;
===Compensation Strategies===&lt;br /&gt;
&lt;br /&gt;
There are a number of compensation strategies for dealing with drift, depending on the cause of it. If the problem is due to CPU management issues, then the host can give more CPU time to the virtual machine, or it can lower the timer interrupt rate – or simply use a tickless counter. If it is due to a  memory management issue, allocating more memory to the virtual environment should prevent the system from needing to swap out page files so often.&lt;br /&gt;
&lt;br /&gt;
If the issue is from neither of those, but simply due to the inevitable lag when the guest communicates with the hardware via the host, then there are other methods to correct the drift. Most systems natively have algorithms built in to correct the time if it gets too far ahead or behind real time, though they are not without their own faults; if the time is set ahead when catching up, the backlog of ticks it has built up may not be cleared, so it could potentially set itself ahead multiple times until the backlog is dealt with. Tools built into the virtual machine itself can also deal with drift to an extent, as VMware Tools does. This kind of tool checks to see if the clock&#039;s error is within a certain margin. If it exceeds the margin, then the backlog is set to zero – to prevent the issue mentioned with the native algorithms – and resynchronizes with the host clock before the guest goes back to keeping track of time as it normally would.&lt;br /&gt;
&lt;br /&gt;
=Research problem=&lt;br /&gt;
&lt;br /&gt;
Today, the use of the Network Time Protocol and of daemons like ntpd is the dominant solution for accurate timekeeping. In optimal conditions, the ntpd can be very good but these situations rarely happen. Network congestion, disconnections, lower quality networking hardware and unsuspected system events can create offsets errors in the order of 10‘s or even 100 milliseconds(ms). [http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/tscclock_final.pdf]&lt;br /&gt;
For demanding applications, this is neither robust or reliable. One way to enhance the performance of ntpd would be to poll from the NTP server more often as this would reduced the offset error but unfortunately, this would increase the network traffic which could cause network congestions which would raise the offset error. So this won’t work. &lt;br /&gt;
&lt;br /&gt;
Another problem with current system software clocks using NTP(like ntpd), is that they provide only an absolute clock.[http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/synch_ToN.pdf]&lt;br /&gt;
So for applications that deals with network managements and measurements, this is unsuitable. Why? Because NTP focus on offset and not on hardware clock oscillator rate. For example, when calculating delay variations, the offset error doesn’t change anything to the calculations but the clocks’ oscillator rate variation does affect it. So having a more accurate timestamp would make those calculation more precise. Which mean we would need another system software clock.&lt;br /&gt;
&lt;br /&gt;
In virtualization(in this case Xen), when migrating a running system from one system to another can cause issues and this is again caused by the ntpd daemon. By default, each guest OS runs its own instance of the ntpd daemon. So the synchronization algorithm keeps track of the reference wallclock time, rate-of-drift and current clock error, which are defined by the hardware clock on the system. So when migrating the virtualized OS to another system, the ntpd state is saved and when it is enabled again on the new system, thats where the problems starts. Because no two hardware clocks drifts the same way or have the exact same wallclock time, all the information traced by the daemon are all of a sudden inaccurate. This could prove disastrous to the system. This could go from a slowly recoverable error to one where ntpd might never recover, making the virtualized OS unstable.&lt;br /&gt;
&lt;br /&gt;
=Contribution=&lt;br /&gt;
&lt;br /&gt;
(sections are stubs for the moment ... more to come)&lt;br /&gt;
The contributions of this paper were:&lt;br /&gt;
&lt;br /&gt;
* baseline evaluations of:&lt;br /&gt;
** performance of NTP in dependent and independent configurations&lt;br /&gt;
** Xen Clocksource as a basis counter under NTP&lt;br /&gt;
** latencies of different clock sources&lt;br /&gt;
** implications of Power Management&lt;br /&gt;
&lt;br /&gt;
* new architecture&lt;br /&gt;
** RADclock&lt;br /&gt;
** feed-forward versus feedback&lt;br /&gt;
&lt;br /&gt;
* evaluation of RADclock vs ntpd&lt;br /&gt;
&lt;br /&gt;
=Critique=&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
1. &amp;quot;Virtualize Everything But Time&amp;quot; by Timothy Broomhead, Laurence Cremean, Julien Ridoux and Darrel Veitch, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Broomhead.pdf&lt;br /&gt;
&lt;br /&gt;
2. &amp;quot;Timekeeping in Virtual Machines, Information Guide&amp;quot; from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
&lt;br /&gt;
3. &amp;quot;Bran&#039;s Kernel Development Tutorial&amp;quot; from Bona Fide OS Developer website. http://www.osdever.net/bkerndev/Docs/pit.htm  &lt;br /&gt;
&lt;br /&gt;
4. &amp;quot;What is a CMOS battery, and why does my computer need one?&amp;quot; from the Indiana University&#039;s Knowledge Base, 2010. http://kb.iu.edu/data/adoy.html&lt;br /&gt;
&lt;br /&gt;
5. &amp;quot;Multiprocessor Specification version 1.4&amp;quot; from Intel, 1997. http://developer.intel.com/design/pentium/datashts/24201606.pdf&lt;br /&gt;
&lt;br /&gt;
6. &amp;quot;PC Based Precision Timing Without GPS&amp;quot; by Attila Pa ́sztor and Darryl Veitch, 2002. http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/tscclock_final.pdf&lt;br /&gt;
&lt;br /&gt;
7. &amp;quot;Robust Synchronization of Absolute and Difference Clocks over Networks&amp;quot; by Darryl Veitch, Julien Ridoux and Satish Babu Korada, 2009. http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/synch_ToN.pdf&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=6139</id>
		<title>COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=6139"/>
		<updated>2010-12-02T03:34:05Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Timers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Virtualize Everything But Time =&lt;br /&gt;
Article written by Timothy Broomhead, Laurence Cremean, Julien Ridoux and Darrel Veitch. They are working for the Center for Ultra-Broadband Information Networks (CUBIN) Department of Electrical &amp;amp; Electronic Engineering at the University of Melbourne in Australia. Here is the link to the article: [http://www.usenix.org/events/osdi10/tech/full_papers/Broomhead.pdf]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Background Concepts=&lt;br /&gt;
&lt;br /&gt;
The next time you notice one stranger ask another for the time and you see them check their watch, try this experiment: immediately ask too. Chances are the person will check their watch again. Why? Human internal clocks are notoriously unreliable. Our sense of time contracts and expands all day long. We seem to believe that a definitive report of time can only come from some mechanical or electronic source. So social norms require that the watch owner provides you with two things: 1) the time, and 2) a gesture of external authority, i.e. a glance at their watch.&lt;br /&gt;
&lt;br /&gt;
The story of time inside a virtual machine is almost as unreliable as our own internal clocks. How much time has elapsed since a VM client got the CPU&#039;s attention? At the best of times there&#039;s no way for it to guess because it wasn&#039;t actually running. If the VM was suspended and migrated from one physical host to another its concept of time is even worse. This paper is about how a computer glances at its metaphorical watch, and what kinds of timepieces it has at hand.&lt;br /&gt;
&lt;br /&gt;
To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.&lt;br /&gt;
&lt;br /&gt;
===Timekeeping===&lt;br /&gt;
&lt;br /&gt;
Since thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.&lt;br /&gt;
&lt;br /&gt;
===Clocks===&lt;br /&gt;
&lt;br /&gt;
Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:&lt;br /&gt;
&lt;br /&gt;
Diagram1. Timer Abstraction&lt;br /&gt;
&lt;br /&gt;
[[File:Timerabstract.jpg]]&lt;br /&gt;
&lt;br /&gt;
This diagram nicely represents how tick counting works. The oscillator runs at a predetermined frequency. The operating system might have to measure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again. Not all hardware timers work exactly like that. For instance, some actually count up, others don&#039;t use interrupts, and yet others don&#039;t keep an initial counter. The general principle of hardware counters is the however the same. There is some kind of fixed interval at the end of which the current time is updated by an appropriate number of units (i.e. nanoseconds).&lt;br /&gt;
&lt;br /&gt;
===Timers===&lt;br /&gt;
# PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. As such, we only need to concern ourselves with Channel 0. [http://www.osdever.net/bkerndev/Docs/pit.htm]&lt;br /&gt;
# CMOS RTC, also known as a CMOS battery, allows the CMOS chip to remain powered to keep track of things like time even while the physical PC unit has no source of power. If there is no CMOS battery on the motherboard, the computer would reset to its default time each restart. The battery itself can die, as expected, if the computer is powered off and not used for a long period of time. This can cause issues with the main OS as well as the VM. [http://kb.iu.edu/data/adoy.html]&lt;br /&gt;
# Local APIC handles all external interrupts for the processor in the system. It can also accept and generate inter-processor interrupts between Local APICs. [http://developer.intel.com/design/pentium/datashts/24201606.pdf]&lt;br /&gt;
# ACPI establishes industry-standard interfaces configuration guided by the OS and power management. Power Management includes notebooks, desktops, and servers. ACPI&#039;s goal is to improve current power and configuration standards for hardware devices by transitioning to ACPI-compliant hardware. This allows the OS as well as the VM to have control over power management. [http://www.intel.com/technology/iapc/acpi/][http://www.acpi.info/][http://www.acpi.info/DOWNLOADS/ACPIspec40a.pdf]&lt;br /&gt;
# TSC&lt;br /&gt;
# HPET (I will fill these in later, Dec. 1st Update) --[[User:Spanke|Spanke]]&lt;br /&gt;
&lt;br /&gt;
==Guest Timekeeping==&lt;br /&gt;
&lt;br /&gt;
Guest timekeeping is done using the same general methods as any computer timekeeping, using either tick counting or tickless systems. Where the two begin to differ, however, is that a host operating system is able to communicate directly with the physical hardware, while the guest operating system is unable to do so, having to communicate with the host system that it wants to communicate with the hardware. Having to do this is the greatest source of the guest operating system&#039;s clock losing accuracy, or more simply called drifting.&lt;br /&gt;
&lt;br /&gt;
===Sources of Drift===&lt;br /&gt;
&lt;br /&gt;
When a guest operating system is started, its clock simply synchronizes with the host&#039;s – some virtual machines such as VMware also do this when it is resumed from a suspended state, or restored from a snapshot – so it is easy to think that, since it starts off correctly the guest&#039;s clock will continue to be correct. That is, of course, incorrect. The first source of drift is simply due to the drift a host incurs in its own timekeeping. A clock is almost never entirely accurate, having a slight error due to the time used to communicate with the counter, even on the host system, and because the guest communicates with the host in order to keep track of its time, an error in the host&#039;s time is not only passed on to the guest, but because the host is trying to correct its own time the guest&#039;s request for a count is given slightly less priority, making it yet again lose accuracy. The larger the drift in the host, the larger the drift in the guest, as the host&#039;s drift simply compounds the issue.&lt;br /&gt;
&lt;br /&gt;
Aside from the host&#039;s own drift, the other cause of drift in the virtual environment is the fact that the it is treated like a process by the host. In and of itself this doesn&#039;t seem like a problem, but because of it it can be denied the CPU time required, or allocated less memory than needed. With restricted CPU time, it&#039;s easy for the requested ticks or requested read of a counter to pile up and create a backlog of requests, or simply receive the requested data late enough to throw its clock off. With memory, if the virtual environment does not have enough allocated to it by the host it can run into the problem of swapping out pages that are needed soon. Swapping the pages back in will momentarily bring the entire virtual environment to a halt, so ticks are missed and the clock falls behind.&lt;br /&gt;
&lt;br /&gt;
===Impact of Drift===&lt;br /&gt;
&lt;br /&gt;
The impact of drift essentially boils down to round-off errors and lost ticks. The practical impact of drift, however, is quite apparent in any automated system. For a relatable real-world example, though not in a virtual environment, in a factory&#039;s assembly line, the machinery is finely tuned to do its own specific part at certain intervals, and it generally does so with impressive efficiency. If the clock in the system were to drift, however, a specific machine may move too soon or too late, bringing the line to a potentially catastrophic halt. In a virtual environment, drift is a bit more subtle, as one result of it could be skewed process scheduling – some schedulers give a certain amount of time to a process before moving on, but if the guest&#039;s time has drifted substantially, when it tries to correct its time it could give more or less time to the processes in the scheduler.&lt;br /&gt;
&lt;br /&gt;
===Compensation Strategies===&lt;br /&gt;
&lt;br /&gt;
There are a number of compensation strategies for dealing with drift, depending on the cause of it. If the problem is due to CPU management issues, then the host can give more CPU time to the virtual machine, or it can lower the timer interrupt rate – or simply use a tickless counter. If it is due to a  memory management issue, allocating more memory to the virtual environment should prevent the system from needing to swap out page files so often.&lt;br /&gt;
&lt;br /&gt;
If the issue is from neither of those, but simply due to the inevitable lag when the guest communicates with the hardware via the host, then there are other methods to correct the drift. Most systems natively have algorithms built in to correct the time if it gets too far ahead or behind real time, though they are not without their own faults; if the time is set ahead when catching up, the backlog of ticks it has built up may not be cleared, so it could potentially set itself ahead multiple times until the backlog is dealt with. Tools built into the virtual machine itself can also deal with drift to an extent, as VMware Tools does. This kind of tool checks to see if the clock&#039;s error is within a certain margin. If it exceeds the margin, then the backlog is set to zero – to prevent the issue mentioned with the native algorithms – and resynchronizes with the host clock before the guest goes back to keeping track of time as it normally would.&lt;br /&gt;
&lt;br /&gt;
=Research problem=&lt;br /&gt;
&lt;br /&gt;
Today, the use of the Network Time Protocol and of daemons like ntpd is the dominant solution for accurate timekeeping. In optimal conditions, the ntpd can be very good but these situations rarely happen. Network congestion, disconnections, lower quality networking hardware and unsuspected system events can create offsets errors in the order of 10‘s or even 100 milliseconds(ms). [http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/tscclock_final.pdf]&lt;br /&gt;
For demanding applications, this is neither robust or reliable. One way to enhance the performance of ntpd would be to poll from the NTP server more often as this would reduced the offset error but unfortunately, this would increase the network traffic which could cause network congestions which would raise the offset error. So this won’t work. &lt;br /&gt;
&lt;br /&gt;
Another problem with current system software clocks using NTP(like ntpd), is that they provide only an absolute clock.[http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/synch_ToN.pdf]&lt;br /&gt;
So for applications that deals with network managements and measurements, this is unsuitable. Why? Because NTP focus on offset and not on hardware clock oscillator rate. For example, when calculating delay variations, the offset error doesn’t change anything to the calculations but the clocks’ oscillator rate variation does affect it. So having a more accurate timestamp would make those calculation more precise. Which mean we would need another system software clock.&lt;br /&gt;
&lt;br /&gt;
In virtualization(in this case Xen), when migrating a running system from one system to another can cause issues and this is again caused by the ntpd daemon. By default, each guest OS runs its own instance of the ntpd daemon. So the synchronization algorithm keeps track of the reference wallclock time, rate-of-drift and current clock error, which are defined by the hardware clock on the system. So when migrating the virtualized OS to another system, the ntpd state is saved and when it is enabled again on the new system, thats where the problems starts. Because no two hardware clocks drifts the same way or have the exact same wallclock time, all the information traced by the daemon are all of a sudden inaccurate. This could prove disastrous to the system. This could go from a slowly recoverable error to one where ntpd might never recover, making the virtualized OS unstable.&lt;br /&gt;
&lt;br /&gt;
=Contribution=&lt;br /&gt;
&lt;br /&gt;
(sections are stubs for the moment ... more to come)&lt;br /&gt;
The contributions of this paper were:&lt;br /&gt;
&lt;br /&gt;
* baseline evaluations of:&lt;br /&gt;
** performance of NTP in dependent and independent configurations&lt;br /&gt;
** Xen Clocksource as a basis counter under NTP&lt;br /&gt;
** latencies of different clock sources&lt;br /&gt;
** implications of Power Management&lt;br /&gt;
&lt;br /&gt;
* new architecture&lt;br /&gt;
** RADclock&lt;br /&gt;
** feed-forward versus feedback&lt;br /&gt;
&lt;br /&gt;
* evaluation of RADclock vs ntpd&lt;br /&gt;
&lt;br /&gt;
=Critique=&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
1. &amp;quot;Virtualize Everything But Time&amp;quot; by Timothy Broomhead, Laurence Cremean, Julien Ridoux and Darrel Veitch, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Broomhead.pdf&lt;br /&gt;
&lt;br /&gt;
2. &amp;quot;Timekeeping in Virtual Machines, Information Guide&amp;quot; from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
&lt;br /&gt;
3. &amp;quot;Bran&#039;s Kernel Development Tutorial&amp;quot; from Bona Fide OS Developer website. http://www.osdever.net/bkerndev/Docs/pit.htm  &lt;br /&gt;
&lt;br /&gt;
4. &amp;quot;What is a CMOS battery, and why does my computer need one?&amp;quot; from the Indiana University&#039;s Knowledge Base, 2010. http://kb.iu.edu/data/adoy.html&lt;br /&gt;
&lt;br /&gt;
5. &amp;quot;Multiprocessor Specification version 1.4&amp;quot; from Intel, 1997. http://developer.intel.com/design/pentium/datashts/24201606.pdf&lt;br /&gt;
&lt;br /&gt;
6. &amp;quot;PC Based Precision Timing Without GPS&amp;quot; by Attila Pa ́sztor and Darryl Veitch, 2002. http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/tscclock_final.pdf&lt;br /&gt;
&lt;br /&gt;
7. &amp;quot;Robust Synchronization of Absolute and Difference Clocks over Networks&amp;quot; by Darryl Veitch, Julien Ridoux and Satish Babu Korada, 2009. http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/synch_ToN.pdf&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=6129</id>
		<title>COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=6129"/>
		<updated>2010-12-02T03:25:20Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Timers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Virtualize Everything But Time =&lt;br /&gt;
Article written by Timothy Broomhead, Laurence Cremean, Julien Ridoux and Darrel Veitch. They are working for the Center for Ultra-Broadband Information Networks (CUBIN) Department of Electrical &amp;amp; Electronic Engineering at the University of Melbourne in Australia. Here is the link to the article: [http://www.usenix.org/events/osdi10/tech/full_papers/Broomhead.pdf]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Background Concepts=&lt;br /&gt;
&lt;br /&gt;
The next time you notice one stranger ask another for the time and you see them check their watch, try this experiment: immediately ask too. Chances are the person will check their watch again. Why? Human internal clocks are notoriously unreliable. Our sense of time contracts and expands all day long. We seem to believe that a definitive report of time can only come from some mechanical or electronic source. So social norms require that the watch owner provides you with two things: 1) the time, and 2) a gesture of external authority, i.e. a glance at their watch.&lt;br /&gt;
&lt;br /&gt;
The story of time inside a virtual machine is almost as unreliable as our own internal clocks. How much time has elapsed since a VM client got the CPU&#039;s attention? At the best of times there&#039;s no way for it to guess because it wasn&#039;t actually running. If the VM was suspended and migrated from one physical host to another its concept of time is even worse. This paper is about how a computer glances at its metaphorical watch, and what kinds of timepieces it has at hand.&lt;br /&gt;
&lt;br /&gt;
To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.&lt;br /&gt;
&lt;br /&gt;
===Timekeeping===&lt;br /&gt;
&lt;br /&gt;
Since thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.&lt;br /&gt;
&lt;br /&gt;
===Clocks===&lt;br /&gt;
&lt;br /&gt;
Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:&lt;br /&gt;
&lt;br /&gt;
Diagram1. Timer Abstraction&lt;br /&gt;
&lt;br /&gt;
[[File:Timerabstract.jpg]]&lt;br /&gt;
&lt;br /&gt;
This diagram nicely represents how tick counting works. The oscillator runs at a predetermined frequency. The operating system might have to measure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again. Not all hardware timers work exactly like that. For instance, some actually count up, others don&#039;t use interrupts, and yet others don&#039;t keep an initial counter. The general principle of hardware counters is the however the same. There is some kind of fixed interval at the end of which the current time is updated by an appropriate number of units (i.e. nanoseconds).&lt;br /&gt;
&lt;br /&gt;
===Timers===&lt;br /&gt;
# PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. As such, we only need to concern ourselves with Channel 0. [http://www.osdever.net/bkerndev/Docs/pit.htm]&lt;br /&gt;
# CMOS RTC, also known as a CMOS battery, allows the CMOS chip to remain powered to keep track of things like time even while the physical PC unit has no source of power. If there is no CMOS battery on the motherboard, the computer would reset to its default time each restart. The battery itself can die, as expected, if the computer is powered off and not used for a long period of time. This can cause issues with the main OS as well as the VM. [http://kb.iu.edu/data/adoy.html]&lt;br /&gt;
# Local APIC handles all external interrupts for the processor in the system. It can also accept and generate inter-processor interrupts between Local APICs. [http://developer.intel.com/design/pentium/datashts/24201606.pdf]&lt;br /&gt;
# ACPI establishes industry-standard interfaces configuration guided by the OS and power management. Power Management includes notebooks, desktops, and servers. [http://www.intel.com/technology/iapc/acpi/][http://www.acpi.info/]&lt;br /&gt;
# TSC&lt;br /&gt;
# HPET (I will fill these in later, Dec. 1st Update) --[[User:Spanke|Spanke]]&lt;br /&gt;
&lt;br /&gt;
==Guest Timekeeping==&lt;br /&gt;
&lt;br /&gt;
Guest timekeeping is done using the same general methods as any computer timekeeping, using either tick counting or tickless systems. Where the two begin to differ, however, is that a host operating system is able to communicate directly with the physical hardware, while the guest operating system is unable to do so, having to communicate with the host system that it wants to communicate with the hardware. Having to do this is the greatest source of the guest operating system&#039;s clock losing accuracy, or more simply called drifting.&lt;br /&gt;
&lt;br /&gt;
===Sources of Drift===&lt;br /&gt;
&lt;br /&gt;
When a guest operating system is started, its clock simply synchronizes with the host&#039;s – some virtual machines such as VMware also do this when it is resumed from a suspended state, or restored from a snapshot – so it is easy to think that, since it starts off correctly the guest&#039;s clock will continue to be correct. That is, of course, incorrect. The first source of drift is simply due to the drift a host incurs in its own timekeeping. A clock is almost never entirely accurate, having a slight error due to the time used to communicate with the counter, even on the host system, and because the guest communicates with the host in order to keep track of its time, an error in the host&#039;s time is not only passed on to the guest, but because the host is trying to correct its own time the guest&#039;s request for a count is given slightly less priority, making it yet again lose accuracy. The larger the drift in the host, the larger the drift in the guest, as the host&#039;s drift simply compounds the issue.&lt;br /&gt;
&lt;br /&gt;
Aside from the host&#039;s own drift, the other cause of drift in the virtual environment is the fact that the it is treated like a process by the host. In and of itself this doesn&#039;t seem like a problem, but because of it it can be denied the CPU time required, or allocated less memory than needed. With restricted CPU time, it&#039;s easy for the requested ticks or requested read of a counter to pile up and create a backlog of requests, or simply receive the requested data late enough to throw its clock off. With memory, if the virtual environment does not have enough allocated to it by the host it can run into the problem of swapping out pages that are needed soon. Swapping the pages back in will momentarily bring the entire virtual environment to a halt, so ticks are missed and the clock falls behind.&lt;br /&gt;
&lt;br /&gt;
===Impact of Drift===&lt;br /&gt;
&lt;br /&gt;
The impact of drift essentially boils down to round-off errors and lost ticks. The practical impact of drift, however, is quite apparent in any automated system. For a relatable real-world example, though not in a virtual environment, in a factory&#039;s assembly line, the machinery is finely tuned to do its own specific part at certain intervals, and it generally does so with impressive efficiency. If the clock in the system were to drift, however, a specific machine may move too soon or too late, bringing the line to a potentially catastrophic halt. In a virtual environment, drift is a bit more subtle, as one result of it could be skewed process scheduling – some schedulers give a certain amount of time to a process before moving on, but if the guest&#039;s time has drifted substantially, when it tries to correct its time it could give more or less time to the processes in the scheduler.&lt;br /&gt;
&lt;br /&gt;
===Compensation Strategies===&lt;br /&gt;
&lt;br /&gt;
There are a number of compensation strategies for dealing with drift, depending on the cause of it. If the problem is due to CPU management issues, then the host can give more CPU time to the virtual machine, or it can lower the timer interrupt rate – or simply use a tickless counter. If it is due to a  memory management issue, allocating more memory to the virtual environment should prevent the system from needing to swap out page files so often.&lt;br /&gt;
&lt;br /&gt;
If the issue is from neither of those, but simply due to the inevitable lag when the guest communicates with the hardware via the host, then there are other methods to correct the drift. Most systems natively have algorithms built in to correct the time if it gets too far ahead or behind real time, though they are not without their own faults; if the time is set ahead when catching up, the backlog of ticks it has built up may not be cleared, so it could potentially set itself ahead multiple times until the backlog is dealt with. Tools built into the virtual machine itself can also deal with drift to an extent, as VMware Tools does. This kind of tool checks to see if the clock&#039;s error is within a certain margin. If it exceeds the margin, then the backlog is set to zero – to prevent the issue mentioned with the native algorithms – and resynchronizes with the host clock before the guest goes back to keeping track of time as it normally would.&lt;br /&gt;
&lt;br /&gt;
=Research problem=&lt;br /&gt;
&lt;br /&gt;
Today, the use of the Network Time Protocol and of daemons like ntpd is the dominant solution for accurate timekeeping. In optimal conditions, the ntpd can be very good but these situations rarely happen. Network congestion, disconnections, lower quality networking hardware and unsuspected system events can create offsets errors in the order of 10‘s or even 100 milliseconds(ms). [http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/tscclock_final.pdf]&lt;br /&gt;
For demanding applications, this is neither robust or reliable. One way to enhance the performance of ntpd would be to poll from the NTP server more often as this would reduced the offset error but unfortunately, this would increase the network traffic which could cause network congestions which would raise the offset error. So this won’t work. &lt;br /&gt;
&lt;br /&gt;
Another problem with current system software clocks using NTP(like ntpd), is that they provide only an absolute clock.[http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/synch_ToN.pdf]&lt;br /&gt;
So for applications that deals with network managements and measurements, this is unsuitable. Why? Because NTP focus on offset and not on hardware clock oscillator rate. For example, when calculating delay variations, the offset error doesn’t change anything to the calculations but the clocks’ oscillator rate variation does affect it. So having a more accurate timestamp would make those calculation more precise. Which mean we would need another system software clock.&lt;br /&gt;
&lt;br /&gt;
In virtualization(in this case Xen), when migrating a running system from one system to another can cause issues and this is again caused by the ntpd daemon. By default, each guest OS runs its own instance of the ntpd daemon. So the synchronization algorithm keeps track of the reference wallclock time, rate-of-drift and current clock error, which are defined by the hardware clock on the system. So when migrating the virtualized OS to another system, the ntpd state is saved and when it is enabled again on the new system, thats where the problems starts. Because no two hardware clocks drifts the same way or have the exact same wallclock time, all the information traced by the daemon are all of a sudden inaccurate. This could prove disastrous to the system. This could go from a slowly recoverable error to one where ntpd might never recover, making the virtualized OS unstable.&lt;br /&gt;
&lt;br /&gt;
=Contribution=&lt;br /&gt;
&lt;br /&gt;
(sections are stubs for the moment ... more to come)&lt;br /&gt;
The contributions of this paper were:&lt;br /&gt;
&lt;br /&gt;
* baseline evaluations of:&lt;br /&gt;
** performance of NTP in dependent and independent configurations&lt;br /&gt;
** Xen Clocksource as a basis counter under NTP&lt;br /&gt;
** latencies of different clock sources&lt;br /&gt;
** implications of Power Management&lt;br /&gt;
&lt;br /&gt;
* new architecture&lt;br /&gt;
** RADclock&lt;br /&gt;
** feed-forward versus feedback&lt;br /&gt;
&lt;br /&gt;
* evaluation of RADclock vs ntpd&lt;br /&gt;
&lt;br /&gt;
=Critique=&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
1. &amp;quot;Virtualize Everything But Time&amp;quot; by Timothy Broomhead, Laurence Cremean, Julien Ridoux and Darrel Veitch, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Broomhead.pdf&lt;br /&gt;
&lt;br /&gt;
2. &amp;quot;Timekeeping in Virtual Machines, Information Guide&amp;quot; from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
&lt;br /&gt;
3. &amp;quot;Bran&#039;s Kernel Development Tutorial&amp;quot; from Bona Fide OS Developer website. http://www.osdever.net/bkerndev/Docs/pit.htm  &lt;br /&gt;
&lt;br /&gt;
4. &amp;quot;What is a CMOS battery, and why does my computer need one?&amp;quot; from the Indiana University&#039;s Knowledge Base, 2010. http://kb.iu.edu/data/adoy.html&lt;br /&gt;
&lt;br /&gt;
5. &amp;quot;Multiprocessor Specification version 1.4&amp;quot; from Intel, 1997. http://developer.intel.com/design/pentium/datashts/24201606.pdf&lt;br /&gt;
&lt;br /&gt;
6. &amp;quot;PC Based Precision Timing Without GPS&amp;quot; by Attila Pa ́sztor and Darryl Veitch, 2002. http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/tscclock_final.pdf&lt;br /&gt;
&lt;br /&gt;
7. &amp;quot;Robust Synchronization of Absolute and Difference Clocks over Networks&amp;quot; by Darryl Veitch, Julien Ridoux and Satish Babu Korada, 2009. http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/synch_ToN.pdf&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5693</id>
		<title>COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5693"/>
		<updated>2010-11-29T17:41:20Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Timers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Background Concepts=&lt;br /&gt;
&lt;br /&gt;
To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.&lt;br /&gt;
&lt;br /&gt;
===Timekeeping===&lt;br /&gt;
&lt;br /&gt;
Since thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[1]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.&lt;br /&gt;
&lt;br /&gt;
===Clocks===&lt;br /&gt;
&lt;br /&gt;
Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:&lt;br /&gt;
&lt;br /&gt;
Diagram1. Timer Abstraction&lt;br /&gt;
&lt;br /&gt;
[[File:Timerabstract.jpg]]&lt;br /&gt;
&lt;br /&gt;
This diagram nicely represent how tick counting works[2]. The oscillator runs at a predetermine frequency. The operating system might have to mesure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again. As I said earlier, not all hardware timer works exactly like that. Some actually counts up, doesn&#039;t use interrupts or doesn&#039;t have an initial counter value but they still follow the same principle.&lt;br /&gt;
&lt;br /&gt;
===Timers===&lt;br /&gt;
# PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. As such, we only need to concern ourselves with Channel 0. [http://www.osdever.net/bkerndev/Docs/pit.htm]&lt;br /&gt;
# CMOS RTC, also known as a CMOS battery, allows the CMOS chip to remain powered to keep track of things like time even while the physical PC unit has no source of power. If there is no CMOS battery on the motherboard, the computer would reset to its default time each restart. The battery itself can die, as expected, if the computer is powered off and not used for a long period of time. This can cause issues with the main OS as well as the VM. [http://kb.iu.edu/data/adoy.html]&lt;br /&gt;
# Local APIC handles all external interrupts for the processor in the system. It can also accept and generate inter-processor interrupts between Local APICs. [http://developer.intel.com/design/pentium/datashts/24201606.pdf]&lt;br /&gt;
# ACPI (I will fill these in later, Nov. 29th Update) --[[User:Spanke|Spanke]]&lt;br /&gt;
# TSC&lt;br /&gt;
# HPET&lt;br /&gt;
&lt;br /&gt;
==Guest Timekeeping==&lt;br /&gt;
&lt;br /&gt;
Guest timekeeping is done using the same general methods as any computer timekeeping, using either tick counting or tickless systems. Where the two begin to differ, however, is that a host operating system is able to communicate directly with the physical hardware, while the guest operating system is unable to do so, having to communicate with the host system that it wants to communicate with the hardware. Having to do this is the greatest source of the guest operating system&#039;s clock losing accuracy, or more simply called drifting.&lt;br /&gt;
&lt;br /&gt;
===Sources of Drift===&lt;br /&gt;
&lt;br /&gt;
When a guest operating system is started, its clock simply synchronizes with the host&#039;s – some virtual machines such as VMware also do this when it is resumed from a suspended state, or restored from a snapshot – so it is easy to think that, since it starts off correctly the guest&#039;s clock will continue to be correct. That is, of course, incorrect. The first source of drift is simply due to the drift a host incurs in its own timekeeping. A clock is almost never entirely accurate, having a slight error due to the time used to communicate with the counter, even on the host system, and because the guest communicates with the host in order to keep track of its time, an error in the host&#039;s time is not only passed on to the guest, but because the host is trying to correct its own time the guest&#039;s request for a count is given slightly less priority, making it yet again lose accuracy. The larger the drift in the host, the larger the drift in the guest, as the host&#039;s drift simply compounds the issue.&lt;br /&gt;
&lt;br /&gt;
Aside from the host&#039;s own drift, the other cause of drift in the virtual environment is the fact that the it is treated like a process by the host. In and of itself this doesn&#039;t seem like a problem, but because of it it can be denied the CPU time required, or allocated less memory than needed. With restricted CPU time, it&#039;s easy for the requested ticks or requested read of a counter to pile up and create a backlog of requests, or simply receive the requested data late enough to throw its clock off. With memory, if the virtual environment does not have enough allocated to it by the host it can run into the problem of swapping out pages that are needed soon. Swapping the pages back in will momentarily bring the entire virtual environment to a halt, so ticks are missed and the clock falls behind.&lt;br /&gt;
&lt;br /&gt;
===Impact of Drift===&lt;br /&gt;
&lt;br /&gt;
The impact of drift essentially boils down to round-off errors and lost ticks. The practical impact of drift, however, is quite apparent in any automated system. For a relatable real-world example, though not in a virtual environment, in a factory&#039;s assembly line, the machinery is finely tuned to do its own specific part at certain intervals, and it generally does so with impressive efficiency. If the clock in the system were to drift, however, a specific machine may move too soon or too late, bringing the line to a potentially catastrophic halt. In a virtual environment, drift is a bit more subtle, as one result of it could be skewed process scheduling – some schedulers give a certain amount of time to a process before moving on, but if the guest&#039;s time has drifted substantially, when it tries to correct its time it could give more or less time to the processes in the scheduler.&lt;br /&gt;
&lt;br /&gt;
===Compensation Strategies===&lt;br /&gt;
&lt;br /&gt;
There are a number of compensation strategies for dealing with drift, depending on the cause of it. If the problem is due to CPU management issues, then the host can give more CPU time to the virtual machine, or it can lower the timer interrupt rate – or simply use a tickless counter. If it is due to a  memory management issue, allocating more memory to the virtual environment should prevent the system from needing to swap out page files so often.&lt;br /&gt;
&lt;br /&gt;
If the issue is from neither of those, but simply due to the inevitable lag when the guest communicates with the hardware via the host, then there are other methods to correct the drift. Most systems natively have algorithms built in to correct the time if it gets too far ahead or behind real time, though they are not without their own faults; if the time is set ahead when catching up, the backlog of ticks it has built up may not be cleared, so it could potentially set itself ahead multiple times until the backlog is dealt with. Tools built into the virtual machine itself can also deal with drift to an extent, as VMware Tools does. This kind of tool checks to see if the clock&#039;s error is within a certain margin. If it exceeds the margin, then the backlog is set to zero – to prevent the issue mentioned with the native algorithms – and resynchronizes with the host clock before the guest goes back to keeping track of time as it normally would.&lt;br /&gt;
&lt;br /&gt;
=Research problem=&lt;br /&gt;
=Contribution=&lt;br /&gt;
=Critique=&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
1. Timekeeping in Virtual Machines, Information Guide from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
&lt;br /&gt;
2. Modern Operating System 3rd Edition, by Andrew S. Tanenbaum, published by Pearson.&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5691</id>
		<title>COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5691"/>
		<updated>2010-11-29T16:57:31Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Timers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Background Concepts=&lt;br /&gt;
&lt;br /&gt;
To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.&lt;br /&gt;
&lt;br /&gt;
===Timekeeping===&lt;br /&gt;
&lt;br /&gt;
Since thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[1]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.&lt;br /&gt;
&lt;br /&gt;
===Clocks===&lt;br /&gt;
&lt;br /&gt;
Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:&lt;br /&gt;
&lt;br /&gt;
Diagram1. Timer Abstraction&lt;br /&gt;
&lt;br /&gt;
[[File:Timerabstract.jpg]]&lt;br /&gt;
&lt;br /&gt;
This diagram nicely represent how tick counting works[2]. The oscillator runs at a predetermine frequency. The operating system might have to mesure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again. As I said earlier, not all hardware timer works exactly like that. Some actually counts up, doesn&#039;t use interrupts or doesn&#039;t have an initial counter value but they still follow the same principle.&lt;br /&gt;
&lt;br /&gt;
===Timers===&lt;br /&gt;
# PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. As such, we only need to concern ourselves with Channel 0. [http://www.osdever.net/bkerndev/Docs/pit.htm]&lt;br /&gt;
# CMOS RTC, also known as a CMOS battery, allows the CMOS chip to remain powered to keep track of things like time even while the physical PC unit has no source of power. If there is no CMOS battery on the motherboard, the computer would reset to its default time each restart. The battery itself can die, as expected, if the computer is powered off and not used for a long period of time. This can cause issues with the main OS as well as the VM. [http://kb.iu.edu/data/adoy.html]&lt;br /&gt;
# Local APIC (I will fill these in later, Nov. 29th Update) --[[User:Spanke|Spanke]]&lt;br /&gt;
# ACPI&lt;br /&gt;
# TSC&lt;br /&gt;
# HPET&lt;br /&gt;
&lt;br /&gt;
==Guest Timekeeping==&lt;br /&gt;
&lt;br /&gt;
Guest timekeeping is done using the same general methods as any computer timekeeping, using either tick counting or tickless systems. Where the two begin to differ, however, is that a host operating system is able to communicate directly with the physical hardware, while the guest operating system is unable to do so, having to communicate with the host system that it wants to communicate with the hardware. Having to do this is the greatest source of the guest operating system&#039;s clock losing accuracy, or more simply called drifting.&lt;br /&gt;
&lt;br /&gt;
===Sources of Drift===&lt;br /&gt;
&lt;br /&gt;
When a guest operating system is started, its clock simply synchronizes with the host&#039;s – some virtual machines such as VMware also do this when it is resumed from a suspended state, or restored from a snapshot – so it is easy to think that, since it starts off correctly the guest&#039;s clock will continue to be correct. That is, of course, incorrect. The first source of drift is simply due to the drift a host incurs in its own timekeeping. A clock is almost never entirely accurate, having a slight error due to the time used to communicate with the counter, even on the host system, and because the guest communicates with the host in order to keep track of its time, an error in the host&#039;s time is not only passed on to the guest, but because the host is trying to correct its own time the guest&#039;s request for a count is given slightly less priority, making it yet again lose accuracy. The larger the drift in the host, the larger the drift in the guest, as the host&#039;s drift simply compounds the issue.&lt;br /&gt;
&lt;br /&gt;
Aside from the host&#039;s own drift, the other cause of drift in the virtual environment is the fact that the it is treated like a process by the host. In and of itself this doesn&#039;t seem like a problem, but because of it it can be denied the CPU time required, or allocated less memory than needed. With restricted CPU time, it&#039;s easy for the requested ticks or requested read of a counter to pile up and create a backlog of requests, or simply receive the requested data late enough to throw its clock off. With memory, if the virtual environment does not have enough allocated to it by the host it can run into the problem of swapping out pages that are needed soon. Swapping the pages back in will momentarily bring the entire virtual environment to a halt, so ticks are missed and the clock falls behind.&lt;br /&gt;
&lt;br /&gt;
===Impact of Drift===&lt;br /&gt;
&lt;br /&gt;
The impact of drift essentially boils down to round-off errors and lost ticks. The practical impact of drift, however, is quite apparent in any automated system. For a relatable real-world example, though not in a virtual environment, in a factory&#039;s assembly line, the machinery is finely tuned to do its own specific part at certain intervals, and it generally does so with impressive efficiency. If the clock in the system were to drift, however, a specific machine may move too soon or too late, bringing the line to a potentially catastrophic halt. In a virtual environment, drift is a bit more subtle, as one result of it could be skewed process scheduling – some schedulers give a certain amount of time to a process before moving on, but if the guest&#039;s time has drifted substantially, when it tries to correct its time it could give more or less time to the processes in the scheduler.&lt;br /&gt;
&lt;br /&gt;
===Compensation Strategies===&lt;br /&gt;
&lt;br /&gt;
There are a number of compensation strategies for dealing with drift, depending on the cause of it. If the problem is due to CPU management issues, then the host can give more CPU time to the virtual machine, or it can lower the timer interrupt rate – or simply use a tickless counter. If it is due to a  memory management issue, allocating more memory to the virtual environment should prevent the system from needing to swap out page files so often.&lt;br /&gt;
&lt;br /&gt;
If the issue is from neither of those, but simply due to the inevitable lag when the guest communicates with the hardware via the host, then there are other methods to correct the drift. Most systems natively have algorithms built in to correct the time if it gets too far ahead or behind real time, though they are not without their own faults; if the time is set ahead when catching up, the backlog of ticks it has built up may not be cleared, so it could potentially set itself ahead multiple times until the backlog is dealt with. Tools built into the virtual machine itself can also deal with drift to an extent, as VMware Tools does. This kind of tool checks to see if the clock&#039;s error is within a certain margin. If it exceeds the margin, then the backlog is set to zero – to prevent the issue mentioned with the native algorithms – and resynchronizes with the host clock before the guest goes back to keeping track of time as it normally would.&lt;br /&gt;
&lt;br /&gt;
=Research problem=&lt;br /&gt;
=Contribution=&lt;br /&gt;
=Critique=&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
1. Timekeeping in Virtual Machines, Information Guide from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
&lt;br /&gt;
2. Modern Operating System 3rd Edition, by Andrew S. Tanenbaum, published by Pearson.&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5690</id>
		<title>COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5690"/>
		<updated>2010-11-29T16:47:56Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Timers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Background Concepts=&lt;br /&gt;
&lt;br /&gt;
To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.&lt;br /&gt;
&lt;br /&gt;
===Timekeeping===&lt;br /&gt;
&lt;br /&gt;
Since thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[1]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.&lt;br /&gt;
&lt;br /&gt;
===Clocks===&lt;br /&gt;
&lt;br /&gt;
Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:&lt;br /&gt;
&lt;br /&gt;
Diagram1. Timer Abstraction&lt;br /&gt;
&lt;br /&gt;
[[File:Timerabstract.jpg]]&lt;br /&gt;
&lt;br /&gt;
This diagram nicely represent how tick counting works[2]. The oscillator runs at a predetermine frequency. The operating system might have to mesure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again. As I said earlier, not all hardware timer works exactly like that. Some actually counts up, doesn&#039;t use interrupts or doesn&#039;t have an initial counter value but they still follow the same principle.&lt;br /&gt;
&lt;br /&gt;
===Timers===&lt;br /&gt;
# PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. As such, we only need to concern ourselves with Channel 0. [http://www.osdever.net/bkerndev/Docs/pit.htm]&lt;br /&gt;
# CMOS RTC, also known as a CMOS battery, allows the system to keep time even while the physical PC unit is unplugged, or the laptop is out of power. If there is no CMOS battery on the motherboard, the computer would reset to its default time each restart. The battery can die, as expected, if the computer is powered off and not used for a long period of time. This can cause issues with the main OS as well as the VM.&lt;br /&gt;
# Local APIC (I will fill these in later, Nov. 29th Update) --[[User:Spanke|Spanke]]&lt;br /&gt;
# ACPI&lt;br /&gt;
# TSC&lt;br /&gt;
# HPET&lt;br /&gt;
&lt;br /&gt;
==Guest Timekeeping==&lt;br /&gt;
&lt;br /&gt;
Guest timekeeping is done using the same general methods as any computer timekeeping, using either tick counting or tickless systems. Where the two begin to differ, however, is that a host operating system is able to communicate directly with the physical hardware, while the guest operating system is unable to do so, having to communicate with the host system that it wants to communicate with the hardware. Having to do this is the greatest source of the guest operating system&#039;s clock losing accuracy, or more simply called drifting.&lt;br /&gt;
&lt;br /&gt;
===Sources of Drift===&lt;br /&gt;
&lt;br /&gt;
When a guest operating system is started, its clock simply synchronizes with the host&#039;s – some virtual machines such as VMware also do this when it is resumed from a suspended state, or restored from a snapshot – so it is easy to think that, since it starts off correctly the guest&#039;s clock will continue to be correct. That is, of course, incorrect. The first source of drift is simply due to the drift a host incurs in its own timekeeping. A clock is almost never entirely accurate, having a slight error due to the time used to communicate with the counter, even on the host system, and because the guest communicates with the host in order to keep track of its time, an error in the host&#039;s time is not only passed on to the guest, but because the host is trying to correct its own time the guest&#039;s request for a count is given slightly less priority, making it yet again lose accuracy. The larger the drift in the host, the larger the drift in the guest, as the host&#039;s drift simply compounds the issue.&lt;br /&gt;
&lt;br /&gt;
Aside from the host&#039;s own drift, the other cause of drift in the virtual environment is the fact that the it is treated like a process by the host. In and of itself this doesn&#039;t seem like a problem, but because of it it can be denied the CPU time required, or allocated less memory than needed. With restricted CPU time, it&#039;s easy for the requested ticks or requested read of a counter to pile up and create a backlog of requests, or simply receive the requested data late enough to throw its clock off. With memory, if the virtual environment does not have enough allocated to it by the host it can run into the problem of swapping out pages that are needed soon. Swapping the pages back in will momentarily bring the entire virtual environment to a halt, so ticks are missed and the clock falls behind.&lt;br /&gt;
&lt;br /&gt;
===Impact of Drift===&lt;br /&gt;
&lt;br /&gt;
The impact of drift essentially boils down to round-off errors and lost ticks. The practical impact of drift, however, is quite apparent in any automated system. For a relatable real-world example, though not in a virtual environment, in a factory&#039;s assembly line, the machinery is finely tuned to do its own specific part at certain intervals, and it generally does so with impressive efficiency. If the clock in the system were to drift, however, a specific machine may move too soon or too late, bringing the line to a potentially catastrophic halt. In a virtual environment, drift is a bit more subtle, as one result of it could be skewed process scheduling – some schedulers give a certain amount of time to a process before moving on, but if the guest&#039;s time has drifted substantially, when it tries to correct its time it could give more or less time to the processes in the scheduler.&lt;br /&gt;
&lt;br /&gt;
===Compensation Strategies===&lt;br /&gt;
&lt;br /&gt;
There are a number of compensation strategies for dealing with drift, depending on the cause of it. If the problem is due to CPU management issues, then the host can give more CPU time to the virtual machine, or it can lower the timer interrupt rate – or simply use a tickless counter. If it is due to a  memory management issue, allocating more memory to the virtual environment should prevent the system from needing to swap out page files so often.&lt;br /&gt;
&lt;br /&gt;
If the issue is from neither of those, but simply due to the inevitable lag when the guest communicates with the hardware via the host, then there are other methods to correct the drift. Most systems natively have algorithms built in to correct the time if it gets too far ahead or behind real time, though they are not without their own faults; if the time is set ahead when catching up, the backlog of ticks it has built up may not be cleared, so it could potentially set itself ahead multiple times until the backlog is dealt with. Tools built into the virtual machine itself can also deal with drift to an extent, as VMware Tools does. This kind of tool checks to see if the clock&#039;s error is within a certain margin. If it exceeds the margin, then the backlog is set to zero – to prevent the issue mentioned with the native algorithms – and resynchronizes with the host clock before the guest goes back to keeping track of time as it normally would.&lt;br /&gt;
&lt;br /&gt;
=Research problem=&lt;br /&gt;
=Contribution=&lt;br /&gt;
=Critique=&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
1. Timekeeping in Virtual Machines, Information Guide from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
&lt;br /&gt;
2. Modern Operating System 3rd Edition, by Andrew S. Tanenbaum, published by Pearson.&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5430</id>
		<title>COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5430"/>
		<updated>2010-11-23T04:28:03Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Timers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Background Concepts=&lt;br /&gt;
&lt;br /&gt;
To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.&lt;br /&gt;
&lt;br /&gt;
===Timekeeping===&lt;br /&gt;
&lt;br /&gt;
Since thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[1]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.&lt;br /&gt;
&lt;br /&gt;
===Clocks===&lt;br /&gt;
&lt;br /&gt;
Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:&lt;br /&gt;
&lt;br /&gt;
Diagram1. Timer Abstraction&lt;br /&gt;
&lt;br /&gt;
[[File:Timerabstract.jpg]]&lt;br /&gt;
&lt;br /&gt;
This diagram nicely represent how tick counting works[2]. The oscillator runs at a predetermine frequency. The operating system might have to mesure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again.&lt;br /&gt;
&lt;br /&gt;
===Timers===&lt;br /&gt;
# PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. As such, we only need to concern ourselves with Channel 0. [http://www.osdever.net/bkerndev/Docs/pit.htm]&lt;br /&gt;
# CMOS RTC (I will fill these in later) --[[User:Spanke|Spanke]]&lt;br /&gt;
# Local APIC&lt;br /&gt;
# ACPI&lt;br /&gt;
# TSC&lt;br /&gt;
# HPET&lt;br /&gt;
&lt;br /&gt;
==Guest Timekeeping==&lt;br /&gt;
This section will explain why timing is different for guest operating systems.&lt;br /&gt;
===Sources of Drift===&lt;br /&gt;
Discuss what causes the problem in virtual environments. Consider para-virtual and full-virtual architectures.&lt;br /&gt;
===Impact of Drift===&lt;br /&gt;
Discuss the impact of error on a few real-world applications.&lt;br /&gt;
===Compensation Strategies===&lt;br /&gt;
Discuss common strategies in VM software and in operating systems to compensate for drift. Discuss what happens when the clock is slow and fast.&lt;br /&gt;
&lt;br /&gt;
=Research problem=&lt;br /&gt;
=Contribution=&lt;br /&gt;
=Critique=&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
1. Timekeeping in Virtual Machines, Information Guide from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
2. Modern Operating System 3rd Edition, by Andrew S. Tanenbaum, published by Pearson.&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5429</id>
		<title>COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5429"/>
		<updated>2010-11-23T04:27:29Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Timers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Background Concepts=&lt;br /&gt;
&lt;br /&gt;
To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.&lt;br /&gt;
&lt;br /&gt;
===Timekeeping===&lt;br /&gt;
&lt;br /&gt;
Since thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[1]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.&lt;br /&gt;
&lt;br /&gt;
===Clocks===&lt;br /&gt;
&lt;br /&gt;
Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:&lt;br /&gt;
&lt;br /&gt;
Diagram1. Timer Abstraction&lt;br /&gt;
&lt;br /&gt;
[[File:Timerabstract.jpg]]&lt;br /&gt;
&lt;br /&gt;
This diagram nicely represent how tick counting works[2]. The oscillator runs at a predetermine frequency. The operating system might have to mesure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again.&lt;br /&gt;
&lt;br /&gt;
===Timers===&lt;br /&gt;
# PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. As such, we only need to concern ourselves with Channel 0. [http://www.osdever.net/bkerndev/Docs/pit.htm]&lt;br /&gt;
# CMOS RTC (I will fill these in later) - Shane&lt;br /&gt;
# Local APIC&lt;br /&gt;
# ACPI&lt;br /&gt;
# TSC&lt;br /&gt;
# HPET&lt;br /&gt;
&lt;br /&gt;
==Guest Timekeeping==&lt;br /&gt;
This section will explain why timing is different for guest operating systems.&lt;br /&gt;
===Sources of Drift===&lt;br /&gt;
Discuss what causes the problem in virtual environments. Consider para-virtual and full-virtual architectures.&lt;br /&gt;
===Impact of Drift===&lt;br /&gt;
Discuss the impact of error on a few real-world applications.&lt;br /&gt;
===Compensation Strategies===&lt;br /&gt;
Discuss common strategies in VM software and in operating systems to compensate for drift. Discuss what happens when the clock is slow and fast.&lt;br /&gt;
&lt;br /&gt;
=Research problem=&lt;br /&gt;
=Contribution=&lt;br /&gt;
=Critique=&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
1. Timekeeping in Virtual Machines, Information Guide from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
2. Modern Operating System 3rd Edition, by Andrew S. Tanenbaum, published by Pearson.&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5428</id>
		<title>COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5428"/>
		<updated>2010-11-23T04:27:01Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Timers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Background Concepts=&lt;br /&gt;
&lt;br /&gt;
To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.&lt;br /&gt;
&lt;br /&gt;
===Timekeeping===&lt;br /&gt;
&lt;br /&gt;
Since thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[1]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.&lt;br /&gt;
&lt;br /&gt;
===Clocks===&lt;br /&gt;
&lt;br /&gt;
Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:&lt;br /&gt;
&lt;br /&gt;
Diagram1. Timer Abstraction&lt;br /&gt;
&lt;br /&gt;
[[File:Timerabstract.jpg]]&lt;br /&gt;
&lt;br /&gt;
This diagram nicely represent how tick counting works[2]. The oscillator runs at a predetermine frequency. The operating system might have to mesure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again.&lt;br /&gt;
&lt;br /&gt;
===Timers===&lt;br /&gt;
# PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. As such, we only need to concern ourselves with Channel 0. [http://www.osdever.net/bkerndev/Docs/pit.htm]&lt;br /&gt;
# CMOS RTC (I will fill these in later)&lt;br /&gt;
# Local APIC&lt;br /&gt;
# ACPI&lt;br /&gt;
# TSC&lt;br /&gt;
# HPET&lt;br /&gt;
&lt;br /&gt;
==Guest Timekeeping==&lt;br /&gt;
This section will explain why timing is different for guest operating systems.&lt;br /&gt;
===Sources of Drift===&lt;br /&gt;
Discuss what causes the problem in virtual environments. Consider para-virtual and full-virtual architectures.&lt;br /&gt;
===Impact of Drift===&lt;br /&gt;
Discuss the impact of error on a few real-world applications.&lt;br /&gt;
===Compensation Strategies===&lt;br /&gt;
Discuss common strategies in VM software and in operating systems to compensate for drift. Discuss what happens when the clock is slow and fast.&lt;br /&gt;
&lt;br /&gt;
=Research problem=&lt;br /&gt;
=Contribution=&lt;br /&gt;
=Critique=&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
1. Timekeeping in Virtual Machines, Information Guide from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
2. Modern Operating System 3rd Edition, by Andrew S. Tanenbaum, published by Pearson.&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5427</id>
		<title>COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5427"/>
		<updated>2010-11-23T04:26:41Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Timers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Background Concepts=&lt;br /&gt;
&lt;br /&gt;
To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.&lt;br /&gt;
&lt;br /&gt;
===Timekeeping===&lt;br /&gt;
&lt;br /&gt;
Since thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[1]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.&lt;br /&gt;
&lt;br /&gt;
===Clocks===&lt;br /&gt;
&lt;br /&gt;
Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:&lt;br /&gt;
&lt;br /&gt;
Diagram1. Timer Abstraction&lt;br /&gt;
&lt;br /&gt;
[[File:Timerabstract.jpg]]&lt;br /&gt;
&lt;br /&gt;
This diagram nicely represent how tick counting works[2]. The oscillator runs at a predetermine frequency. The operating system might have to mesure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again.&lt;br /&gt;
&lt;br /&gt;
===Timers===&lt;br /&gt;
# PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. We only need to concern ourselves with Channel 0. [http://www.osdever.net/bkerndev/Docs/pit.htm]&lt;br /&gt;
# CMOS RTC (I will fill these in later)&lt;br /&gt;
# Local APIC&lt;br /&gt;
# ACPI&lt;br /&gt;
# TSC&lt;br /&gt;
# HPET&lt;br /&gt;
&lt;br /&gt;
==Guest Timekeeping==&lt;br /&gt;
This section will explain why timing is different for guest operating systems.&lt;br /&gt;
===Sources of Drift===&lt;br /&gt;
Discuss what causes the problem in virtual environments. Consider para-virtual and full-virtual architectures.&lt;br /&gt;
===Impact of Drift===&lt;br /&gt;
Discuss the impact of error on a few real-world applications.&lt;br /&gt;
===Compensation Strategies===&lt;br /&gt;
Discuss common strategies in VM software and in operating systems to compensate for drift. Discuss what happens when the clock is slow and fast.&lt;br /&gt;
&lt;br /&gt;
=Research problem=&lt;br /&gt;
=Contribution=&lt;br /&gt;
=Critique=&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
1. Timekeeping in Virtual Machines, Information Guide from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
2. Modern Operating System 3rd Edition, by Andrew S. Tanenbaum, published by Pearson.&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5426</id>
		<title>COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5426"/>
		<updated>2010-11-23T04:22:47Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Timers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Background Concepts=&lt;br /&gt;
&lt;br /&gt;
To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.&lt;br /&gt;
&lt;br /&gt;
===Timekeeping===&lt;br /&gt;
&lt;br /&gt;
Since thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[1]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.&lt;br /&gt;
&lt;br /&gt;
===Clocks===&lt;br /&gt;
&lt;br /&gt;
Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:&lt;br /&gt;
&lt;br /&gt;
Diagram1. Timer Abstraction&lt;br /&gt;
&lt;br /&gt;
[[File:Timerabstract.jpg]]&lt;br /&gt;
&lt;br /&gt;
This diagram nicely represent how tick counting works[2]. The oscillator runs at a predetermine frequency. The operating system might have to mesure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again.&lt;br /&gt;
&lt;br /&gt;
===Timers===&lt;br /&gt;
# PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. [http://www.osdever.net/bkerndev/Docs/pit.htm]&lt;br /&gt;
# CMOS RTC (I will fill these in later)&lt;br /&gt;
# Local APIC&lt;br /&gt;
# ACPI&lt;br /&gt;
# TSC&lt;br /&gt;
# HPET&lt;br /&gt;
&lt;br /&gt;
==Guest Timekeeping==&lt;br /&gt;
This section will explain why timing is different for guest operating systems.&lt;br /&gt;
===Sources of Drift===&lt;br /&gt;
Discuss what causes the problem in virtual environments. Consider para-virtual and full-virtual architectures.&lt;br /&gt;
===Impact of Drift===&lt;br /&gt;
Discuss the impact of error on a few real-world applications.&lt;br /&gt;
===Compensation Strategies===&lt;br /&gt;
Discuss common strategies in VM software and in operating systems to compensate for drift. Discuss what happens when the clock is slow and fast.&lt;br /&gt;
&lt;br /&gt;
=Research problem=&lt;br /&gt;
=Contribution=&lt;br /&gt;
=Critique=&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
1. Timekeeping in Virtual Machines, Information Guide from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
2. Modern Operating System 3rd Edition, by Andrew S. Tanenbaum, published by Pearson.&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5425</id>
		<title>COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_2_2010_Question_11&amp;diff=5425"/>
		<updated>2010-11-23T04:22:21Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Timers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Background Concepts=&lt;br /&gt;
&lt;br /&gt;
To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.&lt;br /&gt;
&lt;br /&gt;
===Timekeeping===&lt;br /&gt;
&lt;br /&gt;
Since thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[1]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.&lt;br /&gt;
&lt;br /&gt;
===Clocks===&lt;br /&gt;
&lt;br /&gt;
Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:&lt;br /&gt;
&lt;br /&gt;
Diagram1. Timer Abstraction&lt;br /&gt;
&lt;br /&gt;
[[File:Timerabstract.jpg]]&lt;br /&gt;
&lt;br /&gt;
This diagram nicely represent how tick counting works[2]. The oscillator runs at a predetermine frequency. The operating system might have to mesure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again.&lt;br /&gt;
&lt;br /&gt;
===Timers===&lt;br /&gt;
# PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. (I will fill these in later) [http://www.osdever.net/bkerndev/Docs/pit.htm]&lt;br /&gt;
# CMOS RTC&lt;br /&gt;
# Local APIC&lt;br /&gt;
# ACPI&lt;br /&gt;
# TSC&lt;br /&gt;
# HPET&lt;br /&gt;
&lt;br /&gt;
==Guest Timekeeping==&lt;br /&gt;
This section will explain why timing is different for guest operating systems.&lt;br /&gt;
===Sources of Drift===&lt;br /&gt;
Discuss what causes the problem in virtual environments. Consider para-virtual and full-virtual architectures.&lt;br /&gt;
===Impact of Drift===&lt;br /&gt;
Discuss the impact of error on a few real-world applications.&lt;br /&gt;
===Compensation Strategies===&lt;br /&gt;
Discuss common strategies in VM software and in operating systems to compensate for drift. Discuss what happens when the clock is slow and fast.&lt;br /&gt;
&lt;br /&gt;
=Research problem=&lt;br /&gt;
=Contribution=&lt;br /&gt;
=Critique=&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
1. Timekeeping in Virtual Machines, Information Guide from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf&lt;br /&gt;
2. Modern Operating System 3rd Edition, by Andrew S. Tanenbaum, published by Pearson.&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_11&amp;diff=4965</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 11</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_11&amp;diff=4965"/>
		<updated>2010-11-15T03:05:34Z</updated>

		<summary type="html">&lt;p&gt;Spanke: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Please mark an X if you are able to participate.&lt;br /&gt;
&lt;br /&gt;
(X) Blais   Sylvain sblais2&amp;lt;br&amp;gt;&lt;br /&gt;
(X) Graham  Scott   sgraham6&amp;lt;br&amp;gt;&lt;br /&gt;
( ) Ilitchev Fedor  filitche&amp;lt;br&amp;gt;&lt;br /&gt;
(X) Panke   Shane   spanke&amp;lt;br&amp;gt;&lt;br /&gt;
(X) Shukla  Abhinav ashukla2&amp;lt;br&amp;gt;&lt;br /&gt;
(X) Wilson  Robert  jjpwilso&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=4288</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=4288"/>
		<updated>2010-10-15T00:39:21Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
== The Background ==&lt;br /&gt;
&lt;br /&gt;
A &#039;process&#039; is defined to be &amp;quot;an address-space and a group of resources dedicated to running the program&amp;quot;. On the other hand a &#039;thread&#039; is an independent sequential unit of computation that executes within the context of a kernel supported entity like a &#039;process&#039;. Threads are often classified by their “weight” (or overhead), which corresponds to the amount of context that must be saved when a thread is removed from the processor, and restored when a thread is reinstated on a processor that is a context switch. The context for a process usually includes the hardware register, kernel stack, user-level stack, interrupt vectors, page tables, and more. Threads require less system resources then concurrent cooperating processes and start much easier, therefore, there may exist millions of them in a single process. Loosely based on this there are two major types of threads: kernel and user-mode. Kernel threads are usually considered heavier and designs that involve them are not very scalable. User threads, on the other hand, are mapped to kernel threads and lightwieght. The ratio of the user threads to kernel threads is an important factor when designing scalable systems.&lt;br /&gt;
&lt;br /&gt;
There are a few designs, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools, which are yet another mechanism that allows for high scalability. Systems can support millions of threads within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the basic challenges is to create code which is stable and at the same time scalable. Furthermore, the challenge in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. Ray Bryant and John Hawkes found the following bottlenecks when porting Linux to a 64-core NUMA system. Each of these bottle necks are an example of a type of bottleneck that can appear in any program.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.7621&amp;amp;rep=rep1&amp;amp;type=pdf#page=83]&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux, there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive when compared to what would happen if the information was in the &#039;right place&#039;. Once the misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times, this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; An example of one such bottleneck is the xtime_lock in Linux. Having locking read prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. This was deemed an unacceptable state of affairs. So, the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems: the spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. The Scheduler also had O(n) time complexity which essentially meant that the scheduler had scalability issues and would become inefficient after a particular number of processes. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler with a O(1)time complexity which meant that any number of threads/processes could be scheduled without any overhead. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements that handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines, however, they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed, they sleep or wait. This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can, for instance, handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices == &lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; &lt;br /&gt;
This is the most basic design and the lightweight process. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Management and scheduling is done through thread management. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model. Thread aware operating system is found on Windows XP, Windows 2000, Windows Vista and any latest operating system.&lt;br /&gt;
      &lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.&lt;br /&gt;
&lt;br /&gt;
Locks are an important part of the performance of threads and there are multiple way of controlling and creating locks in order to create a large amount of threads. Single lock (having the data structures all be in one lock) has the advantage that once the processor has acquired the lock it can modify any of the stored data. Using the single lock method means only one lock is needed per thread, decreasing the thread overhead but this also limits the throughput of the system. Multiple lock (having each data structure have it&#039;s own lock) has the advantage of that each action on the data structure is it&#039;s own locking/unlocking operations. Multiple has greater thread overhead (because there are more locks) but the thread throughput is much higher allowing for fast creation of threads. Another downside of multiple lock systems are deadlocks, a deadlock happens when two different threads are waiting for data that the other task holds. Single and multiple lock systems are the inverse of each other and using both depending on the situation can greatly increase the performance of a system.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx][http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/], Microsoft (23 September 2010) --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
Linux Symposium, pg83 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.7621&amp;amp;rep=rep1&amp;amp;type=pdf#page=83]&amp;lt;br&amp;gt;&lt;br /&gt;
PicoThreads: Lightweight Threads in Java [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.9043&amp;amp;rep=rep1&amp;amp;type=pdf]&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=4287</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=4287"/>
		<updated>2010-10-15T00:39:06Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Scalable Threads: The Problems */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
== The Background ==&lt;br /&gt;
&lt;br /&gt;
A &#039;process&#039; is defined to be &amp;quot;an address-space and a group of resources dedicated to running the program&amp;quot;. On the other hand a &#039;thread&#039; is an independent sequential unit of computation that executes within the context of a kernel supported entity like a &#039;process&#039;. Threads are often classified by their “weight” (or overhead), which corresponds to the amount of context that must be saved when a thread is removed from the processor, and restored when a thread is reinstated on a processor that is a context switch. The context for a process usually includes the hardware register, kernel stack, user-level stack, interrupt vectors, page tables, and more. Threads require less system resources then concurrent cooperating processes and start much easier, therefore, there may exist millions of them in a single process. Loosely based on this there are two major types of threads: kernel and user-mode. Kernel threads are usually considered heavier and designs that involve them are not very scalable. User threads, on the other hand, are mapped to kernel threads and lightwieght. The ratio of the user threads to kernel threads is an important factor when designing scalable systems.&lt;br /&gt;
&lt;br /&gt;
There are a few designs, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools, which are yet another mechanism that allows for high scalability. Systems can support millions of threads within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the basic challenges is to create code which is stable and at the same time scalable. Furthermore, the challenge in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. Ray Bryant and John Hawkes found the following bottlenecks when porting Linux to a 64-core NUMA system. Each of these bottle necks are an example of a type of bottleneck that can appear in any program.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.7621&amp;amp;rep=rep1&amp;amp;type=pdf#page=83]&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux, there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive when compared to what would happen if the information was in the &#039;right place&#039;. Once the misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times, this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; An example of one such bottleneck is the xtime_lock in Linux. Having locking read prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. This was deemed an unacceptable state of affairs. So, the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems: the spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. The Scheduler also had O(n) time complexity which essentially meant that the scheduler had scalability issues and would become inefficient after a particular number of processes. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler with a O(1)time complexity which meant that any number of threads/processes could be scheduled without any overhead. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements that handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines, however, they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed, they sleep or wait. This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can, for instance, handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices == &lt;br /&gt;
--[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; &lt;br /&gt;
This is the most basic design and the lightweight process. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Management and scheduling is done through thread management. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model. Thread aware operating system is found on Windows XP, Windows 2000, Windows Vista and any latest operating system.&lt;br /&gt;
      &lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.&lt;br /&gt;
&lt;br /&gt;
Locks are an important part of the performance of threads and there are multiple way of controlling and creating locks in order to create a large amount of threads. Single lock (having the data structures all be in one lock) has the advantage that once the processor has acquired the lock it can modify any of the stored data. Using the single lock method means only one lock is needed per thread, decreasing the thread overhead but this also limits the throughput of the system. Multiple lock (having each data structure have it&#039;s own lock) has the advantage of that each action on the data structure is it&#039;s own locking/unlocking operations. Multiple has greater thread overhead (because there are more locks) but the thread throughput is much higher allowing for fast creation of threads. Another downside of multiple lock systems are deadlocks, a deadlock happens when two different threads are waiting for data that the other task holds. Single and multiple lock systems are the inverse of each other and using both depending on the situation can greatly increase the performance of a system.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx][http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/], Microsoft (23 September 2010) --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
Linux Symposium, pg83 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.7621&amp;amp;rep=rep1&amp;amp;type=pdf#page=83]&amp;lt;br&amp;gt;&lt;br /&gt;
PicoThreads: Lightweight Threads in Java [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.9043&amp;amp;rep=rep1&amp;amp;type=pdf]&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=4286</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=4286"/>
		<updated>2010-10-15T00:38:27Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* The Background */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
== The Background ==&lt;br /&gt;
&lt;br /&gt;
A &#039;process&#039; is defined to be &amp;quot;an address-space and a group of resources dedicated to running the program&amp;quot;. On the other hand a &#039;thread&#039; is an independent sequential unit of computation that executes within the context of a kernel supported entity like a &#039;process&#039;. Threads are often classified by their “weight” (or overhead), which corresponds to the amount of context that must be saved when a thread is removed from the processor, and restored when a thread is reinstated on a processor that is a context switch. The context for a process usually includes the hardware register, kernel stack, user-level stack, interrupt vectors, page tables, and more. Threads require less system resources then concurrent cooperating processes and start much easier, therefore, there may exist millions of them in a single process. Loosely based on this there are two major types of threads: kernel and user-mode. Kernel threads are usually considered heavier and designs that involve them are not very scalable. User threads, on the other hand, are mapped to kernel threads and lightwieght. The ratio of the user threads to kernel threads is an important factor when designing scalable systems.&lt;br /&gt;
&lt;br /&gt;
There are a few designs, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools, which are yet another mechanism that allows for high scalability. Systems can support millions of threads within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the basic challenges is to create code which is stable and at the same time scalable. Furthermore, the challenge in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. Ray Bryant and John Hawkes found the following bottlenecks when porting Linux to a 64-core NUMA system. Each of these bottle necks are an example of a type of bottleneck that can appear in any program.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.7621&amp;amp;rep=rep1&amp;amp;type=pdf#page=83]&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux, there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive when compared to what would happen if the information was in the &#039;right place&#039;. Once the misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times, this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; An example of one such bottleneck is the xtime_lock in Linux. Having locking read prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. This was deemed an unacceptable state of affairs. So, the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems: the spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. The Scheduler also had O(n) time complexity which essentially meant that the scheduler had scalability issues and would become inefficient after a particular number of processes. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler with a O(1)time complexity which meant that any number of threads/processes could be scheduled without any overhead. &lt;br /&gt;
--[[Rannath]]  A few additions--[[Gautam]] --Cache-coherency is not the important part --[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements that handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines, however, they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed, they sleep or wait. This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can, for instance, handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices == &lt;br /&gt;
--[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; &lt;br /&gt;
This is the most basic design and the lightweight process. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Management and scheduling is done through thread management. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model. Thread aware operating system is found on Windows XP, Windows 2000, Windows Vista and any latest operating system.&lt;br /&gt;
      &lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.&lt;br /&gt;
&lt;br /&gt;
Locks are an important part of the performance of threads and there are multiple way of controlling and creating locks in order to create a large amount of threads. Single lock (having the data structures all be in one lock) has the advantage that once the processor has acquired the lock it can modify any of the stored data. Using the single lock method means only one lock is needed per thread, decreasing the thread overhead but this also limits the throughput of the system. Multiple lock (having each data structure have it&#039;s own lock) has the advantage of that each action on the data structure is it&#039;s own locking/unlocking operations. Multiple has greater thread overhead (because there are more locks) but the thread throughput is much higher allowing for fast creation of threads. Another downside of multiple lock systems are deadlocks, a deadlock happens when two different threads are waiting for data that the other task holds. Single and multiple lock systems are the inverse of each other and using both depending on the situation can greatly increase the performance of a system.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx][http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/], Microsoft (23 September 2010) --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
Linux Symposium, pg83 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.7621&amp;amp;rep=rep1&amp;amp;type=pdf#page=83]&amp;lt;br&amp;gt;&lt;br /&gt;
PicoThreads: Lightweight Threads in Java [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.9043&amp;amp;rep=rep1&amp;amp;type=pdf]&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3935</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3935"/>
		<updated>2010-10-14T18:44:07Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Answer */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]] &amp;amp;&amp;amp; [[Shane]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the basic challenges is to create code which is stable and at the same time scalable. Furthermore, the challenge in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. Ray Bryant and John Hawkes found the following bottlenecks when porting Linux to a 64-core NUMA system. Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;&#039;&#039;&#039;cache-coherency operation&#039;&#039;&#039;&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices == &lt;br /&gt;
--[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; &lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx][http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/], Microsoft (23 September 2010) --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3847</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3847"/>
		<updated>2010-10-14T16:25:32Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx][http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/], Microsoft (23 September 2010) --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3846</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3846"/>
		<updated>2010-10-14T16:19:16Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.&amp;lt;ref&amp;gt;[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx &#039;&#039;Scheduling Priorities&#039;&#039;]&amp;lt;/ref&amp;gt;, Microsoft (23 September 2010) --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3845</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3845"/>
		<updated>2010-10-14T16:18:50Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.&amp;lt;ref&amp;gt;[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx &#039;&#039;Scheduling Priorities&#039;&#039;]&amp;lt;/ref&amp;gt;, Microsoft (23 September 2010) --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
{{refs|2}}&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3844</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3844"/>
		<updated>2010-10-14T16:16:59Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.&amp;lt;ref&amp;gt;[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx &#039;&#039;Scheduling Priorities&#039;&#039;]&amp;lt;/ref&amp;gt;, Microsoft (23 September 2010) --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3843</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3843"/>
		<updated>2010-10-14T16:16:05Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.&amp;lt;ref&amp;gt;[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx &#039;&#039;Scheduling Priorities&#039;&#039;]&amp;lt;/ref&amp;gt;, Microsoft (23 September 2010) --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&amp;lt;ref&amp;gt;{{cite web|url=http://uk.pc.ign.com/articles/999/999171p1.html|title=No LAN for Starcraft II|last=Haynes|first=Jeff|publisher=IGN|accessdate=2009-10-27}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3842</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3842"/>
		<updated>2010-10-14T16:14:02Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.&amp;lt;ref&amp;gt;[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx &#039;&#039;Scheduling Priorities&#039;&#039;]&amp;lt;/ref&amp;gt;, Microsoft (23 September 2010) --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3841</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3841"/>
		<updated>2010-10-14T16:12:11Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.&amp;lt;ref&amp;gt;[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx &#039;&#039;Scheduling Priorities&#039;&#039;]&amp;lt;/ref&amp;gt;, Microsoft (23 September 2010) --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3840</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3840"/>
		<updated>2010-10-14T16:11:55Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.&amp;lt;ref&amp;gt;[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx &#039;&#039;Scheduling Priorities&#039;&#039;]&amp;lt;/ref&amp;gt;, Microsoft (23 September 2010) --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
{{Refimprove|date=June 2008}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Resources| ]]&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3839</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3839"/>
		<updated>2010-10-14T16:10:40Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.&amp;lt;ref&amp;gt;[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx &#039;&#039;Scheduling Priorities&#039;&#039;]&amp;lt;/ref&amp;gt;, Microsoft (23 September 2010) --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3838</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3838"/>
		<updated>2010-10-14T16:09:05Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx][2] --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3837</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3837"/>
		<updated>2010-10-14T16:06:42Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx][2] --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
# [http://delivery.acm.org/10.1145/80000/75378/p49-anderson.pdf?key1=75378&amp;amp;key2=7712707821&amp;amp;coll=GUIDE&amp;amp;dl=GUIDE&amp;amp;CFID=105794086&amp;amp;CFTOKEN=35781739 The performance implications of thread management alternatives for shared-memory multiprocessors], ACM (May 1989)&lt;br /&gt;
# [http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx Scheduling Priorities (Windows)], Microsoft (23 September 2010)&lt;br /&gt;
# [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/ Inside the Linux 2.6 Completely Fair Scheduler], IBM (15 December 2009)&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3836</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3836"/>
		<updated>2010-10-14T16:06:24Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx][#2] --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
# [http://delivery.acm.org/10.1145/80000/75378/p49-anderson.pdf?key1=75378&amp;amp;key2=7712707821&amp;amp;coll=GUIDE&amp;amp;dl=GUIDE&amp;amp;CFID=105794086&amp;amp;CFTOKEN=35781739 The performance implications of thread management alternatives for shared-memory multiprocessors], ACM (May 1989)&lt;br /&gt;
# [http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx Scheduling Priorities (Windows)], Microsoft (23 September 2010)&lt;br /&gt;
# [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/ Inside the Linux 2.6 Completely Fair Scheduler], IBM (15 December 2009)&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3835</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3835"/>
		<updated>2010-10-14T16:05:44Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx][http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/] --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
# [http://delivery.acm.org/10.1145/80000/75378/p49-anderson.pdf?key1=75378&amp;amp;key2=7712707821&amp;amp;coll=GUIDE&amp;amp;dl=GUIDE&amp;amp;CFID=105794086&amp;amp;CFTOKEN=35781739 The performance implications of thread management alternatives for shared-memory multiprocessors], ACM (May 1989)&lt;br /&gt;
# [http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx Scheduling Priorities (Windows)], Microsoft (23 September 2010)&lt;br /&gt;
# [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/ Inside the Linux 2.6 Completely Fair Scheduler], IBM (15 December 2009)&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3834</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3834"/>
		<updated>2010-10-14T16:05:01Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx][http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/] --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
# [http://delivery.acm.org/10.1145/80000/75378/p49-anderson.pdf?key1=75378&amp;amp;key2=7712707821&amp;amp;coll=GUIDE&amp;amp;dl=GUIDE&amp;amp;CFID=105794086&amp;amp;CFTOKEN=35781739], ACM (May 1989)&lt;br /&gt;
# [http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx Scheduling Priorities (Windows)], Microsoft (23 September 2010)&lt;br /&gt;
# [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/ Inside the Linux 2.6 Completely Fair Scheduler], IBM (15 December 2009)&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3829</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3829"/>
		<updated>2010-10-14T16:00:06Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.[http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx][http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/] --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3821</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3821"/>
		<updated>2010-10-14T15:51:14Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&lt;br /&gt;
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&amp;lt;br&amp;gt; [[vG]] &amp;amp;&amp;amp; [[Paul]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. &#039;&#039;&#039; and &#039;&#039;&#039;i have added my version to pauls and modified it [[vG]]&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.&lt;br /&gt;
&lt;br /&gt;
When expensive operations are &#039;&#039;&#039;needlessly called&#039;&#039;&#039; one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is expensive compared to what would happen if the information was in the &#039;right place&#039;. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).&lt;br /&gt;
&lt;br /&gt;
Another type of bottleneck is from &#039;&#039;&#039;starvation.&#039;&#039;&#039; One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.&lt;br /&gt;
&lt;br /&gt;
The next type of bottleneck is from &#039;&#039;&#039;course-grained&#039;&#039;&#039; operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity. &lt;br /&gt;
&lt;br /&gt;
Bottlenecks can be from &#039;&#039;&#039;multiple problems.&#039;&#039;&#039; One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).&lt;br /&gt;
&lt;br /&gt;
--[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes. &lt;br /&gt;
&lt;br /&gt;
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished. &lt;br /&gt;
&lt;br /&gt;
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&lt;br /&gt;
Explaining the four types of synchronization:&lt;br /&gt;
&lt;br /&gt;
*Mutex locks uses only a thread thus giving access to only certain part of the code&lt;br /&gt;
*Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released&lt;br /&gt;
*Condition variable synchronization protects the thread until the condition becomes true&lt;br /&gt;
*Counting semaphores delivers access to multiple threads.  It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.&lt;br /&gt;
[[vG]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	&lt;br /&gt;
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	&lt;br /&gt;
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency. --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;diff=3627</id>
		<title>Talk:COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;diff=3627"/>
		<updated>2010-10-14T04:42:54Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Log ==&lt;br /&gt;
&#039;&#039;&#039;Suggestion:&#039;&#039;&#039; Let us maintain our edits here instead of on littering the main page with our names. Also please do not edit without writing to the log so that we know who has done what and when.&lt;br /&gt;
&lt;br /&gt;
Please maintain a log of your activities in the Log Section. So that we can keep track of the evolution of the essay. --[[User:Gautam|Gautam]]&lt;br /&gt;
&lt;br /&gt;
Moved around some info for clarity. Everyone should post your interpretation of the question in simplest possible English so we`re on the same page (as someone, maybe me, seems to have the wrong idea about what we`re trying to talk about) &lt;br /&gt;
More moving for clarity. added an essay outline at bottom (feel free to change)&lt;br /&gt;
filled in the outline somewhat added questions to the outline for everyone to think on.--[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
First Draft for essay. Please modify and add on. --[[User:Gautam|Gautam]] 02:46, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Edited Scheduling Priorities and rewrote some areas to provide a better paragraph structure. --[[User:Spanke|Shane]] 15:25, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Added to the memory management section. --[[User:Hirving|Hirving]] 21:42, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Edited Scalable Threads Problems. Also did a little re-arrangement. --[[User:Gautam|Gautam]] 01:03, 14 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Answered Essay Questions in Discussion. --[[User:Spanke|Shane]] 01:25, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;Add your future activities here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== The Question ==&lt;br /&gt;
&#039;&#039;&#039;Original:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Rannath:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The question seems to be about number and scalability of threads not the gross mechanics.&lt;br /&gt;
&lt;br /&gt;
To be more clear: we can limit ourselves from the thread implementations to the thread scalability... ignore the stuff that required for all threads, unless its required for many threads. (I didn&#039;t find any implementations that required hardware)&lt;br /&gt;
&lt;br /&gt;
I would also argue that since OSs have to run on multiple hardwares one cannot guarantee that unique/rare hardware bits will be there. While we can talk about hardware we should limit it to a mention at most. OR we could mention prospective hardware that could help out, but is not yet standard. It depends on whether we want to do &amp;quot;as it is&amp;quot; or &amp;quot;as it might be&amp;quot;&lt;br /&gt;
&lt;br /&gt;
utility of such massively scalable thread implementations. I took this as: what functionality (of single strings) does one have to give up to make threads scalable.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Gautam:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
I think the hardware is as relevant as the software. Not all things can be done in software and hardware support is an important factor in most of the solutions to many problems that OS face. My take.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Henry:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Since the question is about the system as a whole, I think the answer should include both software and hardware support for large amounts of threads. The questions revolves around how a system can handle millions of threads and what are the major factors that allow the system to do it. Also, the last part of the question seems to ask what this amount of threads allows a process to do.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Shane:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
In response to the above&#039;s idea on the last part of the question, I would argue that it would enable fast execution because all threads that receive a cache miss would be picked up by the other threads so long as there was enough resources. Also the use of more threads would help synchronize the cache (through sharing) so that it would not miss. Of course this would be if they were assigned to the same task, you cannot sync threads running different applications it just wouldn&#039;t make sense. The only issue with this idea is the software must support this number.&lt;br /&gt;
&lt;br /&gt;
== Group 7 ==&lt;br /&gt;
&lt;br /&gt;
Let us start out by listing down our names and email id (preffered). &lt;br /&gt;
&lt;br /&gt;
Gautam Akiwate         &amp;lt;gautam.akiwate@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Patrick Young(rannath) &amp;lt;rannath@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
vG Vivek &amp;lt;support.tamiltreasure@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shane Panke &amp;lt;shanepanke@msn.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Henry Irving &amp;lt;sens.henry@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Paul Raubic &amp;lt;paul_raubic@hotmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Guidelines ==&lt;br /&gt;
&lt;br /&gt;
Raw info should have some indication of where you got it for citation.&lt;br /&gt;
&lt;br /&gt;
Claim your info so we don&#039;t need to dig for who got what when we need clarification.&lt;br /&gt;
&lt;br /&gt;
Feel free to provide info for or edit someone else&#039;s info, just keep their signature so we can discuss changes&lt;br /&gt;
&lt;br /&gt;
sign changes (once) preferably without time stamps Ex: --[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
Please maintain a log of your activities in the Log Section. So that we can keep track of the evolution of the essay. --[[User:Gautam|Gautam]]&lt;br /&gt;
&lt;br /&gt;
== Facts We have ==&lt;br /&gt;
Start by placing the info here so we can sort through it. I&#039;m going to go into full research/essay writing mode on Sunday if there isn&#039;t enough here.&lt;br /&gt;
&lt;br /&gt;
So far we have:&lt;br /&gt;
Three design choices I&#039;ve seen:&lt;br /&gt;
# Smallest possible footprint per-thread (being extremely light weight) - from everywhere&lt;br /&gt;
# least number (none if at all possible) of context switches per-thread - &#039;&#039;5&#039;&#039;&lt;br /&gt;
# use of a &amp;quot;thread pool&amp;quot; - &#039;&#039;3&#039;&#039;&lt;br /&gt;
The idea is to reduce processor time and storage needed per-thread so you can have more in the same amount of space. --[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
Multi-threading is a term used to describe:&lt;br /&gt;
&lt;br /&gt;
* A facility provided by the operating system that enables an application to create threads of execution within a process&lt;br /&gt;
* Applications whose architecture takes advantage of the multi-threading provided by the operating system &lt;br /&gt;
[[vG]]&lt;br /&gt;
----&lt;br /&gt;
These are all related ideas.&lt;br /&gt;
&lt;br /&gt;
Ok, since we are discussing design choices maybe we could also elaborate on the two major types of threads. Here, I already wrote a few lines, source can be found in citation section: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Fibers (user mode threads) provide very quick and efficient switching because there is no need for a system call and kernel is oblivious to a switch - allows for millions of user mode threads. ISSUES: Blocking system calls disables all other fibers.&lt;br /&gt;
On the other hand managing threads through the kernel requires context switch (between user and kernel mode) on creation and removal of a thread therefore programs with prodigious number of threads would suffer huge performance hits.--[[User:Praubic|Praubic]] 18:05, 10 October 2010 (UTC)&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
User-mode scheduling (UMS) is a light-weight mechanism that applications can use to schedule their own threads. The ability to switch between threads in user mode makes UMS more efficient than thread pools for short-duration work items that require few system calls. [[Paul]]&lt;br /&gt;
&lt;br /&gt;
One implementation of UMS is: combination of N:N and N:M, where the N:N relationship reveals N false processors to the user-space so the user can deal with scheduling on their own. &#039;&#039;5&#039;&#039; -[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
I would scrap the first two below, at most mention them...&lt;br /&gt;
&lt;br /&gt;
#time-division multiplexing&lt;br /&gt;
#threads vs processes&lt;br /&gt;
#I/O Scheduling -[[vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Splitting this off because I don&#039;t think it&#039;s technically part of the answer&amp;lt;br&amp;gt;&lt;br /&gt;
Multithreading generally occurs by time-division multiplexing. It makes it possible for the processor to switch between different threads but it happens so fast that the user sees it as it is running at the same time. [[User:vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
Things that we &#039;&#039;&#039;need&#039;&#039;&#039; to cover in the essay:--[[User:Gautam|Gautam]] 19:35, 7 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is a &#039;&#039;&#039;need&#039;&#039;&#039; section 4 below is not &#039;&#039;&#039;needed&#039;&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
(A)Design Decisions &lt;br /&gt;
   1. Type of threading (1:1 1:N M:N)&lt;br /&gt;
   2. Signal handling - we might be able to leave this out as it seems some &amp;quot;light weight&amp;quot; threads use no signals&lt;br /&gt;
   3. Synchronisation&lt;br /&gt;
   4. Memory Handling&lt;br /&gt;
   5. Scheduling Priorities (context switching and how it affects the CPU threading process)[[Paul]]&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Things we might want also to cover in the essay (non-essentials here): --[[User:Rannath|Rannath]] 04:43, 10 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
(A)Design Decisions &lt;br /&gt;
   1. Brief History of threading&lt;br /&gt;
   2. examples of attempts at getting absurd numbers of threads (failures)&lt;br /&gt;
   3. other types of threading, including heavy weight and processes&lt;br /&gt;
   4. Examples of systems that require many threads such as mainframe servers or banking client processing.--[[User:Praubic|Praubic]] 17:34, 11 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Here is an example of a design: (the topic asks for key design choices here is one)&lt;br /&gt;
&lt;br /&gt;
Capriccio is a specific design for scalable user level threads. They are distinct from most designs by being independent of event based mechanisms as well as kernel thread models. They are very good choice for internet servers and this implementations could easily support 100,000 threads. They are characterized by high scalability, efficient stack management and scheduling based on resource usage however the performance is not comparable to event-based systems.--[[User:Praubic|Praubic]] 13:32, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
(B)Kernel &lt;br /&gt;
   1. Program Thread manipulation through system calls --[[User:Hirving|Hirving]] 20:05, 7 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
(C)Hardware --[[User:Hirving|Hirving]] 19:55, 7 October 2010 (UTC)&lt;br /&gt;
   1. Simultaneous Multithreading&lt;br /&gt;
   2. Multi-core processors&lt;br /&gt;
&lt;br /&gt;
== Essay Outline ==&lt;br /&gt;
&lt;br /&gt;
#Thesis is an answer to the question so... that&#039;s the first step, or the last step, we can always present our info and make our thesis match the info.&lt;br /&gt;
#List all questions and points we have about the topic&lt;br /&gt;
&lt;br /&gt;
Questions:&lt;br /&gt;
# What makes threads non-scalable? List the problems&lt;br /&gt;
# What utility do some scalable implementations lack? Why?&lt;br /&gt;
# Just how scalable does a full utility implementation get?&lt;br /&gt;
&lt;br /&gt;
Answers:&lt;br /&gt;
# Memory Usage, Context Switching. Consider using a thread pool.&lt;br /&gt;
# Signals, portability(maybe) both add overhead which would slow down threads&lt;br /&gt;
# If using thread pools, the scalability is then limited to the number of threads in the pool&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Intro (fill in info)&lt;br /&gt;
# Thesis&lt;br /&gt;
# main topics &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Body (made of many main points)&lt;br /&gt;
&lt;br /&gt;
Main Point 1 -[[Rannath]]&amp;lt;br&amp;gt;&lt;br /&gt;
- efficient thread creation/destruction is more scalable&amp;lt;br&amp;gt;&lt;br /&gt;
-- NPTL&#039;s improvements over LinuxThreads- primarily due to lower overhead of creation/destruction &#039;&#039;1&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 2 -[[Rannath]]&amp;lt;br&amp;gt;&lt;br /&gt;
- UMS &amp;amp; user-space threads are more scalable - maybe&amp;lt;br&amp;gt;&lt;br /&gt;
-- context switches are costly &#039;&#039;From class&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- blocking locks have lower latency when twinned with a user space scheduler &#039;&#039;8&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Ok for point 2 -&amp;gt; I posted a draft on the essay page but Im not certain as to whether i should talk about fibers since they are also functioning on user space but theyre not UMS. --[[User:Praubic|Praubic]] 00:18, 14 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Main Point 3&amp;lt;br&amp;gt;&lt;br /&gt;
- Certain bottleneck appear in scaled implementations, removing these improves scalability.&amp;lt;br&amp;gt;&lt;br /&gt;
-- &amp;quot;False cache-line sharing&amp;quot; &#039;&#039;14&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- xtime lock to a lockless lock &#039;&#039;14&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 3.5&amp;lt;br&amp;gt;&lt;br /&gt;
Fine-Grain over course-grain&amp;lt;br&amp;gt;&lt;br /&gt;
-- &amp;quot;Big Kernel Lock&amp;quot; &#039;&#039;14&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- dcache_lock &#039;&#039;14&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Link the Main points to the thesis&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Conclusion&lt;br /&gt;
# restate info&lt;br /&gt;
# affirmation of thesis&lt;br /&gt;
&lt;br /&gt;
Here is the first paragraph that I attempted. Please feel free to change or even delete it from here. &lt;br /&gt;
&lt;br /&gt;
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable User threads on the other hand are mapped to kernel threads by the threads library such as libpthreads. and there are a few designs that incorporate it mainly Fibers and UMS (User Mode Scheudling) which allow for very high scalability.  UMS threads have their own context and resources however the ability to switch in the user mode makes them more efficient (depending on  application) than Thread Pools which are yet another mechanism that allows for high scalability.&lt;br /&gt;
--[[User:Praubic|Praubic]] 19:04, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
we can add this for intro paragraph:&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process?&lt;br /&gt;
&lt;br /&gt;
It is possible for systems to supports millions of threads or more within a single processor, it has the ability to switch execution resource between threads, thus making a concurrent execution. Concurrency is when multiple threads stays on the ques for switching but incapable of running at the same time but it has the ability to make it look like they are running at same time due to the speed they switch. [[vG]] You stated it is possible you did not state how, or rather did not make it clear. The below should be a better interpretation. --[[User:Spanke|Shane]] &lt;br /&gt;
&lt;br /&gt;
Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
I suggest that we start filling out the main points of the essay. We can discuss the intricacies as we go along. --[[User:Gautam|Gautam]] 02:46, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== Sources ==&lt;br /&gt;
&lt;br /&gt;
# Short history of threads in Linux and new implementation of them. [http://www.drdobbs.com/open-source/184406204;jsessionid=3MRSO5YMO1QVRQE1GHRSKHWATMY32JVN NPTL: The New Implementation of Threads for Linux ] [[User:Gautam|Gautam]] 22:18, 5 October 2010 (UTC)&lt;br /&gt;
# This paper discusses the design choices [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.6590&amp;amp;rep=rep1&amp;amp;type=pdf Native POSIX Threads] [[User:Gautam|Gautam]] 22:11, 5 October 2010 (UTC)&lt;br /&gt;
# lightweight threads vs kernel threads [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.9043&amp;amp;rep=rep1&amp;amp;type=pdf PicoThreads: Lightweight Threads in Java] --[[User:Rannath|Rannath]] 00:23, 6 October 2010 (UTC)&lt;br /&gt;
# [http://eigenclass.org/http://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;amp;action=edit&amp;amp;section=7hiki/lightweight-threads-with-lwt Eigenclass Comparing lightweight threads] --[[User:Rannath|Rannath]] 00:23, 6 October 2010 (UTC)&lt;br /&gt;
# A lightwight thread implementation for Unix [http://www.usenix.org/publications/library/proceedings/sa92/stein.pdf Implementing light weight threads] --[[User:Rannath|Rannath]] 00:49, 6 October 2010 (UTC) [[User:Gbint|Gbint]] 19:50, 5 October 2010 (UTC)&lt;br /&gt;
#Not in this group, but I thought that this paper was excellent: [http://www.sandia.gov/~rcmurph/doc/qt_paper.pdf Qthreads: An API for Programming with Millions of Lightweight Threads]&lt;br /&gt;
# Difference between single and multi threading [http://wiki.answers.com/Q/Single_threaded_Process_and_Multi-threaded_Process] [[vG]]&lt;br /&gt;
# [http://hdl.handle.net/1853/6804 Implementation of Scalable Blocking Locks using an Adaptative Thread Scheduler] --[[User:Gautam|Gautam]] 19:35, 7 October 2010 (UTC)&lt;br /&gt;
# Research Group working on Simultaneous Multithreading [http://www.cs.washington.edu/research/smt/ Simultaneous Multithreading] --[[User:Hirving|Hirving]] 19:58, 7 October 2010 (UTC)&lt;br /&gt;
# This site provides in-depth info about threads, threads-pooling, scheduling: http://msdn.microsoft.com/en-us/library/ms684841(VS.85).aspx [[Paul]]&lt;br /&gt;
# Here is another site that outlines THREAD designs and techniques: http://people.csail.mit.edu/rinard/osnotes/h2.html [[Paul]]&lt;br /&gt;
# [http://www.cosc.brocku.ca/Offerings/4P13/slides/threads.ppt Interesting presentation: really worth checking out]  [[Paul]]&lt;br /&gt;
# KERNEL vs USERMODE http://www.wordiq.com/definition/Thread_(computer_science)--[[User:Praubic|Praubic]] 18:06, 10 October 2010 (UTC)&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.7621&amp;amp;rep=rep1&amp;amp;type=pdf#page=83 Scalability in linux]&lt;br /&gt;
# [http://hillside.net/plop/2007/papers/PLoP2007_Ahluwalia.pdf This has something to do with our question...]&lt;br /&gt;
# [http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx Scheduling Priorities (Windows)], Microsoft (23 September 2010) --[[User:Spanke|Shane]]&lt;br /&gt;
# [http://www.novell.com/coolsolutions/feature/14878.html Linux Scheduling Priorities Explained], Novell (11 October 2005) --[[User:Spanke|Shane]]&lt;br /&gt;
# [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/ Inside the Linux 2.6 Completely Fair Scheduler], IBM (15 December 2009) --[[User:Spanke|Shane]]&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;diff=3626</id>
		<title>Talk:COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;diff=3626"/>
		<updated>2010-10-14T04:42:19Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Log ==&lt;br /&gt;
&#039;&#039;&#039;Suggestion:&#039;&#039;&#039; Let us maintain our edits here instead of on littering the main page with our names. Also please do not edit without writing to the log so that we know who has done what and when.&lt;br /&gt;
&lt;br /&gt;
Please maintain a log of your activities in the Log Section. So that we can keep track of the evolution of the essay. --[[User:Gautam|Gautam]]&lt;br /&gt;
&lt;br /&gt;
Moved around some info for clarity. Everyone should post your interpretation of the question in simplest possible English so we`re on the same page (as someone, maybe me, seems to have the wrong idea about what we`re trying to talk about) &lt;br /&gt;
More moving for clarity. added an essay outline at bottom (feel free to change)&lt;br /&gt;
filled in the outline somewhat added questions to the outline for everyone to think on.--[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
First Draft for essay. Please modify and add on. --[[User:Gautam|Gautam]] 02:46, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Edited Scheduling Priorities and rewrote some areas to provide a better paragraph structure. --[[User:Spanke|Shane]] 15:25, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Added to the memory management section. --[[User:Hirving|Hirving]] 21:42, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Edited Scalable Threads Problems. Also did a little re-arrangement. --[[User:Gautam|Gautam]] 01:03, 14 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Answered Essay Questions. --[[User:Spanke|Shane]] 01:25, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;Add your future activities here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== The Question ==&lt;br /&gt;
&#039;&#039;&#039;Original:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Rannath:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The question seems to be about number and scalability of threads not the gross mechanics.&lt;br /&gt;
&lt;br /&gt;
To be more clear: we can limit ourselves from the thread implementations to the thread scalability... ignore the stuff that required for all threads, unless its required for many threads. (I didn&#039;t find any implementations that required hardware)&lt;br /&gt;
&lt;br /&gt;
I would also argue that since OSs have to run on multiple hardwares one cannot guarantee that unique/rare hardware bits will be there. While we can talk about hardware we should limit it to a mention at most. OR we could mention prospective hardware that could help out, but is not yet standard. It depends on whether we want to do &amp;quot;as it is&amp;quot; or &amp;quot;as it might be&amp;quot;&lt;br /&gt;
&lt;br /&gt;
utility of such massively scalable thread implementations. I took this as: what functionality (of single strings) does one have to give up to make threads scalable.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Gautam:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
I think the hardware is as relevant as the software. Not all things can be done in software and hardware support is an important factor in most of the solutions to many problems that OS face. My take.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Henry:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Since the question is about the system as a whole, I think the answer should include both software and hardware support for large amounts of threads. The questions revolves around how a system can handle millions of threads and what are the major factors that allow the system to do it. Also, the last part of the question seems to ask what this amount of threads allows a process to do.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Shane:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
In response to the above&#039;s idea on the last part of the question, I would argue that it would enable fast execution because all threads that receive a cache miss would be picked up by the other threads so long as there was enough resources. Also the use of more threads would help synchronize the cache (through sharing) so that it would not miss. Of course this would be if they were assigned to the same task, you cannot sync threads running different applications it just wouldn&#039;t make sense. The only issue with this idea is the software must support this number.&lt;br /&gt;
&lt;br /&gt;
== Group 7 ==&lt;br /&gt;
&lt;br /&gt;
Let us start out by listing down our names and email id (preffered). &lt;br /&gt;
&lt;br /&gt;
Gautam Akiwate         &amp;lt;gautam.akiwate@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Patrick Young(rannath) &amp;lt;rannath@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
vG Vivek &amp;lt;support.tamiltreasure@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shane Panke &amp;lt;shanepanke@msn.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Henry Irving &amp;lt;sens.henry@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Paul Raubic &amp;lt;paul_raubic@hotmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Guidelines ==&lt;br /&gt;
&lt;br /&gt;
Raw info should have some indication of where you got it for citation.&lt;br /&gt;
&lt;br /&gt;
Claim your info so we don&#039;t need to dig for who got what when we need clarification.&lt;br /&gt;
&lt;br /&gt;
Feel free to provide info for or edit someone else&#039;s info, just keep their signature so we can discuss changes&lt;br /&gt;
&lt;br /&gt;
sign changes (once) preferably without time stamps Ex: --[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
Please maintain a log of your activities in the Log Section. So that we can keep track of the evolution of the essay. --[[User:Gautam|Gautam]]&lt;br /&gt;
&lt;br /&gt;
== Facts We have ==&lt;br /&gt;
Start by placing the info here so we can sort through it. I&#039;m going to go into full research/essay writing mode on Sunday if there isn&#039;t enough here.&lt;br /&gt;
&lt;br /&gt;
So far we have:&lt;br /&gt;
Three design choices I&#039;ve seen:&lt;br /&gt;
# Smallest possible footprint per-thread (being extremely light weight) - from everywhere&lt;br /&gt;
# least number (none if at all possible) of context switches per-thread - &#039;&#039;5&#039;&#039;&lt;br /&gt;
# use of a &amp;quot;thread pool&amp;quot; - &#039;&#039;3&#039;&#039;&lt;br /&gt;
The idea is to reduce processor time and storage needed per-thread so you can have more in the same amount of space. --[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
Multi-threading is a term used to describe:&lt;br /&gt;
&lt;br /&gt;
* A facility provided by the operating system that enables an application to create threads of execution within a process&lt;br /&gt;
* Applications whose architecture takes advantage of the multi-threading provided by the operating system &lt;br /&gt;
[[vG]]&lt;br /&gt;
----&lt;br /&gt;
These are all related ideas.&lt;br /&gt;
&lt;br /&gt;
Ok, since we are discussing design choices maybe we could also elaborate on the two major types of threads. Here, I already wrote a few lines, source can be found in citation section: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Fibers (user mode threads) provide very quick and efficient switching because there is no need for a system call and kernel is oblivious to a switch - allows for millions of user mode threads. ISSUES: Blocking system calls disables all other fibers.&lt;br /&gt;
On the other hand managing threads through the kernel requires context switch (between user and kernel mode) on creation and removal of a thread therefore programs with prodigious number of threads would suffer huge performance hits.--[[User:Praubic|Praubic]] 18:05, 10 October 2010 (UTC)&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
User-mode scheduling (UMS) is a light-weight mechanism that applications can use to schedule their own threads. The ability to switch between threads in user mode makes UMS more efficient than thread pools for short-duration work items that require few system calls. [[Paul]]&lt;br /&gt;
&lt;br /&gt;
One implementation of UMS is: combination of N:N and N:M, where the N:N relationship reveals N false processors to the user-space so the user can deal with scheduling on their own. &#039;&#039;5&#039;&#039; -[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
I would scrap the first two below, at most mention them...&lt;br /&gt;
&lt;br /&gt;
#time-division multiplexing&lt;br /&gt;
#threads vs processes&lt;br /&gt;
#I/O Scheduling -[[vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Splitting this off because I don&#039;t think it&#039;s technically part of the answer&amp;lt;br&amp;gt;&lt;br /&gt;
Multithreading generally occurs by time-division multiplexing. It makes it possible for the processor to switch between different threads but it happens so fast that the user sees it as it is running at the same time. [[User:vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
Things that we &#039;&#039;&#039;need&#039;&#039;&#039; to cover in the essay:--[[User:Gautam|Gautam]] 19:35, 7 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is a &#039;&#039;&#039;need&#039;&#039;&#039; section 4 below is not &#039;&#039;&#039;needed&#039;&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
(A)Design Decisions &lt;br /&gt;
   1. Type of threading (1:1 1:N M:N)&lt;br /&gt;
   2. Signal handling - we might be able to leave this out as it seems some &amp;quot;light weight&amp;quot; threads use no signals&lt;br /&gt;
   3. Synchronisation&lt;br /&gt;
   4. Memory Handling&lt;br /&gt;
   5. Scheduling Priorities (context switching and how it affects the CPU threading process)[[Paul]]&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Things we might want also to cover in the essay (non-essentials here): --[[User:Rannath|Rannath]] 04:43, 10 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
(A)Design Decisions &lt;br /&gt;
   1. Brief History of threading&lt;br /&gt;
   2. examples of attempts at getting absurd numbers of threads (failures)&lt;br /&gt;
   3. other types of threading, including heavy weight and processes&lt;br /&gt;
   4. Examples of systems that require many threads such as mainframe servers or banking client processing.--[[User:Praubic|Praubic]] 17:34, 11 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Here is an example of a design: (the topic asks for key design choices here is one)&lt;br /&gt;
&lt;br /&gt;
Capriccio is a specific design for scalable user level threads. They are distinct from most designs by being independent of event based mechanisms as well as kernel thread models. They are very good choice for internet servers and this implementations could easily support 100,000 threads. They are characterized by high scalability, efficient stack management and scheduling based on resource usage however the performance is not comparable to event-based systems.--[[User:Praubic|Praubic]] 13:32, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
(B)Kernel &lt;br /&gt;
   1. Program Thread manipulation through system calls --[[User:Hirving|Hirving]] 20:05, 7 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
(C)Hardware --[[User:Hirving|Hirving]] 19:55, 7 October 2010 (UTC)&lt;br /&gt;
   1. Simultaneous Multithreading&lt;br /&gt;
   2. Multi-core processors&lt;br /&gt;
&lt;br /&gt;
== Essay Outline ==&lt;br /&gt;
&lt;br /&gt;
#Thesis is an answer to the question so... that&#039;s the first step, or the last step, we can always present our info and make our thesis match the info.&lt;br /&gt;
#List all questions and points we have about the topic&lt;br /&gt;
&lt;br /&gt;
Questions:&lt;br /&gt;
# What makes threads non-scalable? List the problems&lt;br /&gt;
# What utility do some scalable implementations lack? Why?&lt;br /&gt;
# Just how scalable does a full utility implementation get?&lt;br /&gt;
&lt;br /&gt;
Answers:&lt;br /&gt;
# Memory Usage, Context Switching. Consider using a thread pool.&lt;br /&gt;
# Signals, portability(maybe) both add overhead which would slow down threads&lt;br /&gt;
# If using thread pools, the scalability is then limited to the number of threads in the pool&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Intro (fill in info)&lt;br /&gt;
# Thesis&lt;br /&gt;
# main topics &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Body (made of many main points)&lt;br /&gt;
&lt;br /&gt;
Main Point 1 -[[Rannath]]&amp;lt;br&amp;gt;&lt;br /&gt;
- efficient thread creation/destruction is more scalable&amp;lt;br&amp;gt;&lt;br /&gt;
-- NPTL&#039;s improvements over LinuxThreads- primarily due to lower overhead of creation/destruction &#039;&#039;1&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 2 -[[Rannath]]&amp;lt;br&amp;gt;&lt;br /&gt;
- UMS &amp;amp; user-space threads are more scalable - maybe&amp;lt;br&amp;gt;&lt;br /&gt;
-- context switches are costly &#039;&#039;From class&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- blocking locks have lower latency when twinned with a user space scheduler &#039;&#039;8&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Ok for point 2 -&amp;gt; I posted a draft on the essay page but Im not certain as to whether i should talk about fibers since they are also functioning on user space but theyre not UMS. --[[User:Praubic|Praubic]] 00:18, 14 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Main Point 3&amp;lt;br&amp;gt;&lt;br /&gt;
- Certain bottleneck appear in scaled implementations, removing these improves scalability.&amp;lt;br&amp;gt;&lt;br /&gt;
-- &amp;quot;False cache-line sharing&amp;quot; &#039;&#039;14&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- xtime lock to a lockless lock &#039;&#039;14&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 3.5&amp;lt;br&amp;gt;&lt;br /&gt;
Fine-Grain over course-grain&amp;lt;br&amp;gt;&lt;br /&gt;
-- &amp;quot;Big Kernel Lock&amp;quot; &#039;&#039;14&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- dcache_lock &#039;&#039;14&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Link the Main points to the thesis&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Conclusion&lt;br /&gt;
# restate info&lt;br /&gt;
# affirmation of thesis&lt;br /&gt;
&lt;br /&gt;
Here is the first paragraph that I attempted. Please feel free to change or even delete it from here. &lt;br /&gt;
&lt;br /&gt;
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable User threads on the other hand are mapped to kernel threads by the threads library such as libpthreads. and there are a few designs that incorporate it mainly Fibers and UMS (User Mode Scheudling) which allow for very high scalability.  UMS threads have their own context and resources however the ability to switch in the user mode makes them more efficient (depending on  application) than Thread Pools which are yet another mechanism that allows for high scalability.&lt;br /&gt;
--[[User:Praubic|Praubic]] 19:04, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
we can add this for intro paragraph:&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process?&lt;br /&gt;
&lt;br /&gt;
It is possible for systems to supports millions of threads or more within a single processor, it has the ability to switch execution resource between threads, thus making a concurrent execution. Concurrency is when multiple threads stays on the ques for switching but incapable of running at the same time but it has the ability to make it look like they are running at same time due to the speed they switch. [[vG]] You stated it is possible you did not state how, or rather did not make it clear. The below should be a better interpretation. --[[User:Spanke|Shane]] &lt;br /&gt;
&lt;br /&gt;
Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
I suggest that we start filling out the main points of the essay. We can discuss the intricacies as we go along. --[[User:Gautam|Gautam]] 02:46, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== Sources ==&lt;br /&gt;
&lt;br /&gt;
# Short history of threads in Linux and new implementation of them. [http://www.drdobbs.com/open-source/184406204;jsessionid=3MRSO5YMO1QVRQE1GHRSKHWATMY32JVN NPTL: The New Implementation of Threads for Linux ] [[User:Gautam|Gautam]] 22:18, 5 October 2010 (UTC)&lt;br /&gt;
# This paper discusses the design choices [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.6590&amp;amp;rep=rep1&amp;amp;type=pdf Native POSIX Threads] [[User:Gautam|Gautam]] 22:11, 5 October 2010 (UTC)&lt;br /&gt;
# lightweight threads vs kernel threads [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.9043&amp;amp;rep=rep1&amp;amp;type=pdf PicoThreads: Lightweight Threads in Java] --[[User:Rannath|Rannath]] 00:23, 6 October 2010 (UTC)&lt;br /&gt;
# [http://eigenclass.org/http://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;amp;action=edit&amp;amp;section=7hiki/lightweight-threads-with-lwt Eigenclass Comparing lightweight threads] --[[User:Rannath|Rannath]] 00:23, 6 October 2010 (UTC)&lt;br /&gt;
# A lightwight thread implementation for Unix [http://www.usenix.org/publications/library/proceedings/sa92/stein.pdf Implementing light weight threads] --[[User:Rannath|Rannath]] 00:49, 6 October 2010 (UTC) [[User:Gbint|Gbint]] 19:50, 5 October 2010 (UTC)&lt;br /&gt;
#Not in this group, but I thought that this paper was excellent: [http://www.sandia.gov/~rcmurph/doc/qt_paper.pdf Qthreads: An API for Programming with Millions of Lightweight Threads]&lt;br /&gt;
# Difference between single and multi threading [http://wiki.answers.com/Q/Single_threaded_Process_and_Multi-threaded_Process] [[vG]]&lt;br /&gt;
# [http://hdl.handle.net/1853/6804 Implementation of Scalable Blocking Locks using an Adaptative Thread Scheduler] --[[User:Gautam|Gautam]] 19:35, 7 October 2010 (UTC)&lt;br /&gt;
# Research Group working on Simultaneous Multithreading [http://www.cs.washington.edu/research/smt/ Simultaneous Multithreading] --[[User:Hirving|Hirving]] 19:58, 7 October 2010 (UTC)&lt;br /&gt;
# This site provides in-depth info about threads, threads-pooling, scheduling: http://msdn.microsoft.com/en-us/library/ms684841(VS.85).aspx [[Paul]]&lt;br /&gt;
# Here is another site that outlines THREAD designs and techniques: http://people.csail.mit.edu/rinard/osnotes/h2.html [[Paul]]&lt;br /&gt;
# [http://www.cosc.brocku.ca/Offerings/4P13/slides/threads.ppt Interesting presentation: really worth checking out]  [[Paul]]&lt;br /&gt;
# KERNEL vs USERMODE http://www.wordiq.com/definition/Thread_(computer_science)--[[User:Praubic|Praubic]] 18:06, 10 October 2010 (UTC)&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.7621&amp;amp;rep=rep1&amp;amp;type=pdf#page=83 Scalability in linux]&lt;br /&gt;
# [http://hillside.net/plop/2007/papers/PLoP2007_Ahluwalia.pdf This has something to do with our question...]&lt;br /&gt;
# [http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx Scheduling Priorities (Windows)], Microsoft (23 September 2010) --[[User:Spanke|Shane]]&lt;br /&gt;
# [http://www.novell.com/coolsolutions/feature/14878.html Linux Scheduling Priorities Explained], Novell (11 October 2005) --[[User:Spanke|Shane]]&lt;br /&gt;
# [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/ Inside the Linux 2.6 Completely Fair Scheduler], IBM (15 December 2009) --[[User:Spanke|Shane]]&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;diff=3624</id>
		<title>Talk:COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;diff=3624"/>
		<updated>2010-10-14T04:41:21Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Essay Outline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Log ==&lt;br /&gt;
&#039;&#039;&#039;Suggestion:&#039;&#039;&#039; Let us maintain our edits here instead of on littering the main page with our names. Also please do not edit without writing to the log so that we know who has done what and when.&lt;br /&gt;
&lt;br /&gt;
Please maintain a log of your activities in the Log Section. So that we can keep track of the evolution of the essay. --[[User:Gautam|Gautam]]&lt;br /&gt;
&lt;br /&gt;
Moved around some info for clarity. Everyone should post your interpretation of the question in simplest possible English so we`re on the same page (as someone, maybe me, seems to have the wrong idea about what we`re trying to talk about) &lt;br /&gt;
More moving for clarity. added an essay outline at bottom (feel free to change)&lt;br /&gt;
filled in the outline somewhat added questions to the outline for everyone to think on.--[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
First Draft for essay. Please modify and add on. --[[User:Gautam|Gautam]] 02:46, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Edited Scheduling Priorities and rewrote some areas to provide a better paragraph structure. --[[User:Spanke|Shane]] 15:25, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Added to the memory management section. --[[User:Hirving|Hirving]] 21:42, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Edited Scalable Threads Problems. Also did a little re-arrangement. --[[User:Gautam|Gautam]] 01:03, 14 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;Add your future activities here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== The Question ==&lt;br /&gt;
&#039;&#039;&#039;Original:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Rannath:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The question seems to be about number and scalability of threads not the gross mechanics.&lt;br /&gt;
&lt;br /&gt;
To be more clear: we can limit ourselves from the thread implementations to the thread scalability... ignore the stuff that required for all threads, unless its required for many threads. (I didn&#039;t find any implementations that required hardware)&lt;br /&gt;
&lt;br /&gt;
I would also argue that since OSs have to run on multiple hardwares one cannot guarantee that unique/rare hardware bits will be there. While we can talk about hardware we should limit it to a mention at most. OR we could mention prospective hardware that could help out, but is not yet standard. It depends on whether we want to do &amp;quot;as it is&amp;quot; or &amp;quot;as it might be&amp;quot;&lt;br /&gt;
&lt;br /&gt;
utility of such massively scalable thread implementations. I took this as: what functionality (of single strings) does one have to give up to make threads scalable.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Gautam:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
I think the hardware is as relevant as the software. Not all things can be done in software and hardware support is an important factor in most of the solutions to many problems that OS face. My take.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Henry:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Since the question is about the system as a whole, I think the answer should include both software and hardware support for large amounts of threads. The questions revolves around how a system can handle millions of threads and what are the major factors that allow the system to do it. Also, the last part of the question seems to ask what this amount of threads allows a process to do.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Shane:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
In response to the above&#039;s idea on the last part of the question, I would argue that it would enable fast execution because all threads that receive a cache miss would be picked up by the other threads so long as there was enough resources. Also the use of more threads would help synchronize the cache (through sharing) so that it would not miss. Of course this would be if they were assigned to the same task, you cannot sync threads running different applications it just wouldn&#039;t make sense. The only issue with this idea is the software must support this number.&lt;br /&gt;
&lt;br /&gt;
== Group 7 ==&lt;br /&gt;
&lt;br /&gt;
Let us start out by listing down our names and email id (preffered). &lt;br /&gt;
&lt;br /&gt;
Gautam Akiwate         &amp;lt;gautam.akiwate@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Patrick Young(rannath) &amp;lt;rannath@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
vG Vivek &amp;lt;support.tamiltreasure@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shane Panke &amp;lt;shanepanke@msn.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Henry Irving &amp;lt;sens.henry@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Paul Raubic &amp;lt;paul_raubic@hotmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Guidelines ==&lt;br /&gt;
&lt;br /&gt;
Raw info should have some indication of where you got it for citation.&lt;br /&gt;
&lt;br /&gt;
Claim your info so we don&#039;t need to dig for who got what when we need clarification.&lt;br /&gt;
&lt;br /&gt;
Feel free to provide info for or edit someone else&#039;s info, just keep their signature so we can discuss changes&lt;br /&gt;
&lt;br /&gt;
sign changes (once) preferably without time stamps Ex: --[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
Please maintain a log of your activities in the Log Section. So that we can keep track of the evolution of the essay. --[[User:Gautam|Gautam]]&lt;br /&gt;
&lt;br /&gt;
== Facts We have ==&lt;br /&gt;
Start by placing the info here so we can sort through it. I&#039;m going to go into full research/essay writing mode on Sunday if there isn&#039;t enough here.&lt;br /&gt;
&lt;br /&gt;
So far we have:&lt;br /&gt;
Three design choices I&#039;ve seen:&lt;br /&gt;
# Smallest possible footprint per-thread (being extremely light weight) - from everywhere&lt;br /&gt;
# least number (none if at all possible) of context switches per-thread - &#039;&#039;5&#039;&#039;&lt;br /&gt;
# use of a &amp;quot;thread pool&amp;quot; - &#039;&#039;3&#039;&#039;&lt;br /&gt;
The idea is to reduce processor time and storage needed per-thread so you can have more in the same amount of space. --[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
Multi-threading is a term used to describe:&lt;br /&gt;
&lt;br /&gt;
* A facility provided by the operating system that enables an application to create threads of execution within a process&lt;br /&gt;
* Applications whose architecture takes advantage of the multi-threading provided by the operating system &lt;br /&gt;
[[vG]]&lt;br /&gt;
----&lt;br /&gt;
These are all related ideas.&lt;br /&gt;
&lt;br /&gt;
Ok, since we are discussing design choices maybe we could also elaborate on the two major types of threads. Here, I already wrote a few lines, source can be found in citation section: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Fibers (user mode threads) provide very quick and efficient switching because there is no need for a system call and kernel is oblivious to a switch - allows for millions of user mode threads. ISSUES: Blocking system calls disables all other fibers.&lt;br /&gt;
On the other hand managing threads through the kernel requires context switch (between user and kernel mode) on creation and removal of a thread therefore programs with prodigious number of threads would suffer huge performance hits.--[[User:Praubic|Praubic]] 18:05, 10 October 2010 (UTC)&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
User-mode scheduling (UMS) is a light-weight mechanism that applications can use to schedule their own threads. The ability to switch between threads in user mode makes UMS more efficient than thread pools for short-duration work items that require few system calls. [[Paul]]&lt;br /&gt;
&lt;br /&gt;
One implementation of UMS is: combination of N:N and N:M, where the N:N relationship reveals N false processors to the user-space so the user can deal with scheduling on their own. &#039;&#039;5&#039;&#039; -[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
I would scrap the first two below, at most mention them...&lt;br /&gt;
&lt;br /&gt;
#time-division multiplexing&lt;br /&gt;
#threads vs processes&lt;br /&gt;
#I/O Scheduling -[[vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Splitting this off because I don&#039;t think it&#039;s technically part of the answer&amp;lt;br&amp;gt;&lt;br /&gt;
Multithreading generally occurs by time-division multiplexing. It makes it possible for the processor to switch between different threads but it happens so fast that the user sees it as it is running at the same time. [[User:vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
Things that we &#039;&#039;&#039;need&#039;&#039;&#039; to cover in the essay:--[[User:Gautam|Gautam]] 19:35, 7 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is a &#039;&#039;&#039;need&#039;&#039;&#039; section 4 below is not &#039;&#039;&#039;needed&#039;&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
(A)Design Decisions &lt;br /&gt;
   1. Type of threading (1:1 1:N M:N)&lt;br /&gt;
   2. Signal handling - we might be able to leave this out as it seems some &amp;quot;light weight&amp;quot; threads use no signals&lt;br /&gt;
   3. Synchronisation&lt;br /&gt;
   4. Memory Handling&lt;br /&gt;
   5. Scheduling Priorities (context switching and how it affects the CPU threading process)[[Paul]]&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Things we might want also to cover in the essay (non-essentials here): --[[User:Rannath|Rannath]] 04:43, 10 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
(A)Design Decisions &lt;br /&gt;
   1. Brief History of threading&lt;br /&gt;
   2. examples of attempts at getting absurd numbers of threads (failures)&lt;br /&gt;
   3. other types of threading, including heavy weight and processes&lt;br /&gt;
   4. Examples of systems that require many threads such as mainframe servers or banking client processing.--[[User:Praubic|Praubic]] 17:34, 11 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Here is an example of a design: (the topic asks for key design choices here is one)&lt;br /&gt;
&lt;br /&gt;
Capriccio is a specific design for scalable user level threads. They are distinct from most designs by being independent of event based mechanisms as well as kernel thread models. They are very good choice for internet servers and this implementations could easily support 100,000 threads. They are characterized by high scalability, efficient stack management and scheduling based on resource usage however the performance is not comparable to event-based systems.--[[User:Praubic|Praubic]] 13:32, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
(B)Kernel &lt;br /&gt;
   1. Program Thread manipulation through system calls --[[User:Hirving|Hirving]] 20:05, 7 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
(C)Hardware --[[User:Hirving|Hirving]] 19:55, 7 October 2010 (UTC)&lt;br /&gt;
   1. Simultaneous Multithreading&lt;br /&gt;
   2. Multi-core processors&lt;br /&gt;
&lt;br /&gt;
== Essay Outline ==&lt;br /&gt;
&lt;br /&gt;
#Thesis is an answer to the question so... that&#039;s the first step, or the last step, we can always present our info and make our thesis match the info.&lt;br /&gt;
#List all questions and points we have about the topic&lt;br /&gt;
&lt;br /&gt;
Questions:&lt;br /&gt;
# What makes threads non-scalable? List the problems&lt;br /&gt;
# What utility do some scalable implementations lack? Why?&lt;br /&gt;
# Just how scalable does a full utility implementation get?&lt;br /&gt;
&lt;br /&gt;
Answers:&lt;br /&gt;
# Memory Usage, Context Switching. Consider using a thread pool.&lt;br /&gt;
# Signals, portability(maybe) both add overhead which would slow down threads&lt;br /&gt;
# If using thread pools, the scalability is then limited to the number of threads in the pool&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Intro (fill in info)&lt;br /&gt;
# Thesis&lt;br /&gt;
# main topics &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Body (made of many main points)&lt;br /&gt;
&lt;br /&gt;
Main Point 1 -[[Rannath]]&amp;lt;br&amp;gt;&lt;br /&gt;
- efficient thread creation/destruction is more scalable&amp;lt;br&amp;gt;&lt;br /&gt;
-- NPTL&#039;s improvements over LinuxThreads- primarily due to lower overhead of creation/destruction &#039;&#039;1&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 2 -[[Rannath]]&amp;lt;br&amp;gt;&lt;br /&gt;
- UMS &amp;amp; user-space threads are more scalable - maybe&amp;lt;br&amp;gt;&lt;br /&gt;
-- context switches are costly &#039;&#039;From class&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- blocking locks have lower latency when twinned with a user space scheduler &#039;&#039;8&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Ok for point 2 -&amp;gt; I posted a draft on the essay page but Im not certain as to whether i should talk about fibers since they are also functioning on user space but theyre not UMS. --[[User:Praubic|Praubic]] 00:18, 14 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Main Point 3&amp;lt;br&amp;gt;&lt;br /&gt;
- Certain bottleneck appear in scaled implementations, removing these improves scalability.&amp;lt;br&amp;gt;&lt;br /&gt;
-- &amp;quot;False cache-line sharing&amp;quot; &#039;&#039;14&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- xtime lock to a lockless lock &#039;&#039;14&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 3.5&amp;lt;br&amp;gt;&lt;br /&gt;
Fine-Grain over course-grain&amp;lt;br&amp;gt;&lt;br /&gt;
-- &amp;quot;Big Kernel Lock&amp;quot; &#039;&#039;14&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- dcache_lock &#039;&#039;14&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Link the Main points to the thesis&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Conclusion&lt;br /&gt;
# restate info&lt;br /&gt;
# affirmation of thesis&lt;br /&gt;
&lt;br /&gt;
Here is the first paragraph that I attempted. Please feel free to change or even delete it from here. &lt;br /&gt;
&lt;br /&gt;
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable User threads on the other hand are mapped to kernel threads by the threads library such as libpthreads. and there are a few designs that incorporate it mainly Fibers and UMS (User Mode Scheudling) which allow for very high scalability.  UMS threads have their own context and resources however the ability to switch in the user mode makes them more efficient (depending on  application) than Thread Pools which are yet another mechanism that allows for high scalability.&lt;br /&gt;
--[[User:Praubic|Praubic]] 19:04, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
we can add this for intro paragraph:&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process?&lt;br /&gt;
&lt;br /&gt;
It is possible for systems to supports millions of threads or more within a single processor, it has the ability to switch execution resource between threads, thus making a concurrent execution. Concurrency is when multiple threads stays on the ques for switching but incapable of running at the same time but it has the ability to make it look like they are running at same time due to the speed they switch. [[vG]] You stated it is possible you did not state how, or rather did not make it clear. The below should be a better interpretation. --[[User:Spanke|Shane]] &lt;br /&gt;
&lt;br /&gt;
Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
I suggest that we start filling out the main points of the essay. We can discuss the intricacies as we go along. --[[User:Gautam|Gautam]] 02:46, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== Sources ==&lt;br /&gt;
&lt;br /&gt;
# Short history of threads in Linux and new implementation of them. [http://www.drdobbs.com/open-source/184406204;jsessionid=3MRSO5YMO1QVRQE1GHRSKHWATMY32JVN NPTL: The New Implementation of Threads for Linux ] [[User:Gautam|Gautam]] 22:18, 5 October 2010 (UTC)&lt;br /&gt;
# This paper discusses the design choices [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.6590&amp;amp;rep=rep1&amp;amp;type=pdf Native POSIX Threads] [[User:Gautam|Gautam]] 22:11, 5 October 2010 (UTC)&lt;br /&gt;
# lightweight threads vs kernel threads [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.9043&amp;amp;rep=rep1&amp;amp;type=pdf PicoThreads: Lightweight Threads in Java] --[[User:Rannath|Rannath]] 00:23, 6 October 2010 (UTC)&lt;br /&gt;
# [http://eigenclass.org/http://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;amp;action=edit&amp;amp;section=7hiki/lightweight-threads-with-lwt Eigenclass Comparing lightweight threads] --[[User:Rannath|Rannath]] 00:23, 6 October 2010 (UTC)&lt;br /&gt;
# A lightwight thread implementation for Unix [http://www.usenix.org/publications/library/proceedings/sa92/stein.pdf Implementing light weight threads] --[[User:Rannath|Rannath]] 00:49, 6 October 2010 (UTC) [[User:Gbint|Gbint]] 19:50, 5 October 2010 (UTC)&lt;br /&gt;
#Not in this group, but I thought that this paper was excellent: [http://www.sandia.gov/~rcmurph/doc/qt_paper.pdf Qthreads: An API for Programming with Millions of Lightweight Threads]&lt;br /&gt;
# Difference between single and multi threading [http://wiki.answers.com/Q/Single_threaded_Process_and_Multi-threaded_Process] [[vG]]&lt;br /&gt;
# [http://hdl.handle.net/1853/6804 Implementation of Scalable Blocking Locks using an Adaptative Thread Scheduler] --[[User:Gautam|Gautam]] 19:35, 7 October 2010 (UTC)&lt;br /&gt;
# Research Group working on Simultaneous Multithreading [http://www.cs.washington.edu/research/smt/ Simultaneous Multithreading] --[[User:Hirving|Hirving]] 19:58, 7 October 2010 (UTC)&lt;br /&gt;
# This site provides in-depth info about threads, threads-pooling, scheduling: http://msdn.microsoft.com/en-us/library/ms684841(VS.85).aspx [[Paul]]&lt;br /&gt;
# Here is another site that outlines THREAD designs and techniques: http://people.csail.mit.edu/rinard/osnotes/h2.html [[Paul]]&lt;br /&gt;
# [http://www.cosc.brocku.ca/Offerings/4P13/slides/threads.ppt Interesting presentation: really worth checking out]  [[Paul]]&lt;br /&gt;
# KERNEL vs USERMODE http://www.wordiq.com/definition/Thread_(computer_science)--[[User:Praubic|Praubic]] 18:06, 10 October 2010 (UTC)&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.7621&amp;amp;rep=rep1&amp;amp;type=pdf#page=83 Scalability in linux]&lt;br /&gt;
# [http://hillside.net/plop/2007/papers/PLoP2007_Ahluwalia.pdf This has something to do with our question...]&lt;br /&gt;
# [http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx Scheduling Priorities (Windows)], Microsoft (23 September 2010) --[[User:Spanke|Shane]]&lt;br /&gt;
# [http://www.novell.com/coolsolutions/feature/14878.html Linux Scheduling Priorities Explained], Novell (11 October 2005) --[[User:Spanke|Shane]]&lt;br /&gt;
# [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/ Inside the Linux 2.6 Completely Fair Scheduler], IBM (15 December 2009) --[[User:Spanke|Shane]]&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;diff=3620</id>
		<title>Talk:COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;diff=3620"/>
		<updated>2010-10-14T04:35:22Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Essay Outline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Log ==&lt;br /&gt;
&#039;&#039;&#039;Suggestion:&#039;&#039;&#039; Let us maintain our edits here instead of on littering the main page with our names. Also please do not edit without writing to the log so that we know who has done what and when.&lt;br /&gt;
&lt;br /&gt;
Please maintain a log of your activities in the Log Section. So that we can keep track of the evolution of the essay. --[[User:Gautam|Gautam]]&lt;br /&gt;
&lt;br /&gt;
Moved around some info for clarity. Everyone should post your interpretation of the question in simplest possible English so we`re on the same page (as someone, maybe me, seems to have the wrong idea about what we`re trying to talk about) &lt;br /&gt;
More moving for clarity. added an essay outline at bottom (feel free to change)&lt;br /&gt;
filled in the outline somewhat added questions to the outline for everyone to think on.--[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
First Draft for essay. Please modify and add on. --[[User:Gautam|Gautam]] 02:46, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Edited Scheduling Priorities and rewrote some areas to provide a better paragraph structure. --[[User:Spanke|Shane]] 15:25, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Added to the memory management section. --[[User:Hirving|Hirving]] 21:42, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Edited Scalable Threads Problems. Also did a little re-arrangement. --[[User:Gautam|Gautam]] 01:03, 14 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;Add your future activities here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== The Question ==&lt;br /&gt;
&#039;&#039;&#039;Original:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Rannath:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The question seems to be about number and scalability of threads not the gross mechanics.&lt;br /&gt;
&lt;br /&gt;
To be more clear: we can limit ourselves from the thread implementations to the thread scalability... ignore the stuff that required for all threads, unless its required for many threads. (I didn&#039;t find any implementations that required hardware)&lt;br /&gt;
&lt;br /&gt;
I would also argue that since OSs have to run on multiple hardwares one cannot guarantee that unique/rare hardware bits will be there. While we can talk about hardware we should limit it to a mention at most. OR we could mention prospective hardware that could help out, but is not yet standard. It depends on whether we want to do &amp;quot;as it is&amp;quot; or &amp;quot;as it might be&amp;quot;&lt;br /&gt;
&lt;br /&gt;
utility of such massively scalable thread implementations. I took this as: what functionality (of single strings) does one have to give up to make threads scalable.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Gautam:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
I think the hardware is as relevant as the software. Not all things can be done in software and hardware support is an important factor in most of the solutions to many problems that OS face. My take.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Henry:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Since the question is about the system as a whole, I think the answer should include both software and hardware support for large amounts of threads. The questions revolves around how a system can handle millions of threads and what are the major factors that allow the system to do it. Also, the last part of the question seems to ask what this amount of threads allows a process to do.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Shane:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
In response to the above&#039;s idea on the last part of the question, I would argue that it would enable fast execution because all threads that receive a cache miss would be picked up by the other threads so long as there was enough resources. Also the use of more threads would help synchronize the cache (through sharing) so that it would not miss. Of course this would be if they were assigned to the same task, you cannot sync threads running different applications it just wouldn&#039;t make sense. The only issue with this idea is the software must support this number.&lt;br /&gt;
&lt;br /&gt;
== Group 7 ==&lt;br /&gt;
&lt;br /&gt;
Let us start out by listing down our names and email id (preffered). &lt;br /&gt;
&lt;br /&gt;
Gautam Akiwate         &amp;lt;gautam.akiwate@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Patrick Young(rannath) &amp;lt;rannath@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
vG Vivek &amp;lt;support.tamiltreasure@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shane Panke &amp;lt;shanepanke@msn.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Henry Irving &amp;lt;sens.henry@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Paul Raubic &amp;lt;paul_raubic@hotmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Guidelines ==&lt;br /&gt;
&lt;br /&gt;
Raw info should have some indication of where you got it for citation.&lt;br /&gt;
&lt;br /&gt;
Claim your info so we don&#039;t need to dig for who got what when we need clarification.&lt;br /&gt;
&lt;br /&gt;
Feel free to provide info for or edit someone else&#039;s info, just keep their signature so we can discuss changes&lt;br /&gt;
&lt;br /&gt;
sign changes (once) preferably without time stamps Ex: --[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
Please maintain a log of your activities in the Log Section. So that we can keep track of the evolution of the essay. --[[User:Gautam|Gautam]]&lt;br /&gt;
&lt;br /&gt;
== Facts We have ==&lt;br /&gt;
Start by placing the info here so we can sort through it. I&#039;m going to go into full research/essay writing mode on Sunday if there isn&#039;t enough here.&lt;br /&gt;
&lt;br /&gt;
So far we have:&lt;br /&gt;
Three design choices I&#039;ve seen:&lt;br /&gt;
# Smallest possible footprint per-thread (being extremely light weight) - from everywhere&lt;br /&gt;
# least number (none if at all possible) of context switches per-thread - &#039;&#039;5&#039;&#039;&lt;br /&gt;
# use of a &amp;quot;thread pool&amp;quot; - &#039;&#039;3&#039;&#039;&lt;br /&gt;
The idea is to reduce processor time and storage needed per-thread so you can have more in the same amount of space. --[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
Multi-threading is a term used to describe:&lt;br /&gt;
&lt;br /&gt;
* A facility provided by the operating system that enables an application to create threads of execution within a process&lt;br /&gt;
* Applications whose architecture takes advantage of the multi-threading provided by the operating system &lt;br /&gt;
[[vG]]&lt;br /&gt;
----&lt;br /&gt;
These are all related ideas.&lt;br /&gt;
&lt;br /&gt;
Ok, since we are discussing design choices maybe we could also elaborate on the two major types of threads. Here, I already wrote a few lines, source can be found in citation section: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Fibers (user mode threads) provide very quick and efficient switching because there is no need for a system call and kernel is oblivious to a switch - allows for millions of user mode threads. ISSUES: Blocking system calls disables all other fibers.&lt;br /&gt;
On the other hand managing threads through the kernel requires context switch (between user and kernel mode) on creation and removal of a thread therefore programs with prodigious number of threads would suffer huge performance hits.--[[User:Praubic|Praubic]] 18:05, 10 October 2010 (UTC)&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
User-mode scheduling (UMS) is a light-weight mechanism that applications can use to schedule their own threads. The ability to switch between threads in user mode makes UMS more efficient than thread pools for short-duration work items that require few system calls. [[Paul]]&lt;br /&gt;
&lt;br /&gt;
One implementation of UMS is: combination of N:N and N:M, where the N:N relationship reveals N false processors to the user-space so the user can deal with scheduling on their own. &#039;&#039;5&#039;&#039; -[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
I would scrap the first two below, at most mention them...&lt;br /&gt;
&lt;br /&gt;
#time-division multiplexing&lt;br /&gt;
#threads vs processes&lt;br /&gt;
#I/O Scheduling -[[vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Splitting this off because I don&#039;t think it&#039;s technically part of the answer&amp;lt;br&amp;gt;&lt;br /&gt;
Multithreading generally occurs by time-division multiplexing. It makes it possible for the processor to switch between different threads but it happens so fast that the user sees it as it is running at the same time. [[User:vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
Things that we &#039;&#039;&#039;need&#039;&#039;&#039; to cover in the essay:--[[User:Gautam|Gautam]] 19:35, 7 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is a &#039;&#039;&#039;need&#039;&#039;&#039; section 4 below is not &#039;&#039;&#039;needed&#039;&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
(A)Design Decisions &lt;br /&gt;
   1. Type of threading (1:1 1:N M:N)&lt;br /&gt;
   2. Signal handling - we might be able to leave this out as it seems some &amp;quot;light weight&amp;quot; threads use no signals&lt;br /&gt;
   3. Synchronisation&lt;br /&gt;
   4. Memory Handling&lt;br /&gt;
   5. Scheduling Priorities (context switching and how it affects the CPU threading process)[[Paul]]&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Things we might want also to cover in the essay (non-essentials here): --[[User:Rannath|Rannath]] 04:43, 10 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
(A)Design Decisions &lt;br /&gt;
   1. Brief History of threading&lt;br /&gt;
   2. examples of attempts at getting absurd numbers of threads (failures)&lt;br /&gt;
   3. other types of threading, including heavy weight and processes&lt;br /&gt;
   4. Examples of systems that require many threads such as mainframe servers or banking client processing.--[[User:Praubic|Praubic]] 17:34, 11 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Here is an example of a design: (the topic asks for key design choices here is one)&lt;br /&gt;
&lt;br /&gt;
Capriccio is a specific design for scalable user level threads. They are distinct from most designs by being independent of event based mechanisms as well as kernel thread models. They are very good choice for internet servers and this implementations could easily support 100,000 threads. They are characterized by high scalability, efficient stack management and scheduling based on resource usage however the performance is not comparable to event-based systems.--[[User:Praubic|Praubic]] 13:32, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
(B)Kernel &lt;br /&gt;
   1. Program Thread manipulation through system calls --[[User:Hirving|Hirving]] 20:05, 7 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
(C)Hardware --[[User:Hirving|Hirving]] 19:55, 7 October 2010 (UTC)&lt;br /&gt;
   1. Simultaneous Multithreading&lt;br /&gt;
   2. Multi-core processors&lt;br /&gt;
&lt;br /&gt;
== Essay Outline ==&lt;br /&gt;
&lt;br /&gt;
#Thesis is an answer to the question so... that&#039;s the first step, or the last step, we can always present our info and make our thesis match the info.&lt;br /&gt;
#List all questions and points we have about the topic&lt;br /&gt;
&lt;br /&gt;
Questions:&lt;br /&gt;
#What makes threads non-scalable? List the problems&lt;br /&gt;
#What utility do some scalable implementations lack? Why?&lt;br /&gt;
#Just how scalable does a full utility implementation get?&lt;br /&gt;
&lt;br /&gt;
Answers:&lt;br /&gt;
# &lt;br /&gt;
# Signals, portability(maybe) both add overhead which would slow down threads&lt;br /&gt;
#&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Intro (fill in info)&lt;br /&gt;
# Thesis&lt;br /&gt;
# main topics &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Body (made of many main points)&lt;br /&gt;
&lt;br /&gt;
Main Point 1 -[[Rannath]]&amp;lt;br&amp;gt;&lt;br /&gt;
- efficient thread creation/destruction is more scalable&amp;lt;br&amp;gt;&lt;br /&gt;
-- NPTL&#039;s improvements over LinuxThreads- primarily due to lower overhead of creation/destruction &#039;&#039;1&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 2 -[[Rannath]]&amp;lt;br&amp;gt;&lt;br /&gt;
- UMS &amp;amp; user-space threads are more scalable - maybe&amp;lt;br&amp;gt;&lt;br /&gt;
-- context switches are costly &#039;&#039;From class&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- blocking locks have lower latency when twinned with a user space scheduler &#039;&#039;8&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Ok for point 2 -&amp;gt; I posted a draft on the essay page but Im not certain as to whether i should talk about fibers since they are also functioning on user space but theyre not UMS. --[[User:Praubic|Praubic]] 00:18, 14 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Main Point 3&amp;lt;br&amp;gt;&lt;br /&gt;
- Certain bottleneck appear in scaled implementations, removing these improves scalability.&amp;lt;br&amp;gt;&lt;br /&gt;
-- &amp;quot;False cache-line sharing&amp;quot; &#039;&#039;14&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- xtime lock to a lockless lock &#039;&#039;14&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 3.5&amp;lt;br&amp;gt;&lt;br /&gt;
Fine-Grain over course-grain&amp;lt;br&amp;gt;&lt;br /&gt;
-- &amp;quot;Big Kernel Lock&amp;quot; &#039;&#039;14&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- dcache_lock &#039;&#039;14&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Link the Main points to the thesis&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Conclusion&lt;br /&gt;
# restate info&lt;br /&gt;
# affirmation of thesis&lt;br /&gt;
&lt;br /&gt;
Here is the first paragraph that I attempted. Please feel free to change or even delete it from here. &lt;br /&gt;
&lt;br /&gt;
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable User threads on the other hand are mapped to kernel threads by the threads library such as libpthreads. and there are a few designs that incorporate it mainly Fibers and UMS (User Mode Scheudling) which allow for very high scalability.  UMS threads have their own context and resources however the ability to switch in the user mode makes them more efficient (depending on  application) than Thread Pools which are yet another mechanism that allows for high scalability.&lt;br /&gt;
--[[User:Praubic|Praubic]] 19:04, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
we can add this for intro paragraph:&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process?&lt;br /&gt;
&lt;br /&gt;
It is possible for systems to supports millions of threads or more within a single processor, it has the ability to switch execution resource between threads, thus making a concurrent execution. Concurrency is when multiple threads stays on the ques for switching but incapable of running at the same time but it has the ability to make it look like they are running at same time due to the speed they switch. [[vG]] You stated it is possible you did not state how, or rather did not make it clear. The below should be a better interpretation. --[[User:Spanke|Shane]] &lt;br /&gt;
&lt;br /&gt;
Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
I suggest that we start filling out the main points of the essay. We can discuss the intricacies as we go along. --[[User:Gautam|Gautam]] 02:46, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== Sources ==&lt;br /&gt;
&lt;br /&gt;
# Short history of threads in Linux and new implementation of them. [http://www.drdobbs.com/open-source/184406204;jsessionid=3MRSO5YMO1QVRQE1GHRSKHWATMY32JVN NPTL: The New Implementation of Threads for Linux ] [[User:Gautam|Gautam]] 22:18, 5 October 2010 (UTC)&lt;br /&gt;
# This paper discusses the design choices [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.6590&amp;amp;rep=rep1&amp;amp;type=pdf Native POSIX Threads] [[User:Gautam|Gautam]] 22:11, 5 October 2010 (UTC)&lt;br /&gt;
# lightweight threads vs kernel threads [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.9043&amp;amp;rep=rep1&amp;amp;type=pdf PicoThreads: Lightweight Threads in Java] --[[User:Rannath|Rannath]] 00:23, 6 October 2010 (UTC)&lt;br /&gt;
# [http://eigenclass.org/http://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;amp;action=edit&amp;amp;section=7hiki/lightweight-threads-with-lwt Eigenclass Comparing lightweight threads] --[[User:Rannath|Rannath]] 00:23, 6 October 2010 (UTC)&lt;br /&gt;
# A lightwight thread implementation for Unix [http://www.usenix.org/publications/library/proceedings/sa92/stein.pdf Implementing light weight threads] --[[User:Rannath|Rannath]] 00:49, 6 October 2010 (UTC) [[User:Gbint|Gbint]] 19:50, 5 October 2010 (UTC)&lt;br /&gt;
#Not in this group, but I thought that this paper was excellent: [http://www.sandia.gov/~rcmurph/doc/qt_paper.pdf Qthreads: An API for Programming with Millions of Lightweight Threads]&lt;br /&gt;
# Difference between single and multi threading [http://wiki.answers.com/Q/Single_threaded_Process_and_Multi-threaded_Process] [[vG]]&lt;br /&gt;
# [http://hdl.handle.net/1853/6804 Implementation of Scalable Blocking Locks using an Adaptative Thread Scheduler] --[[User:Gautam|Gautam]] 19:35, 7 October 2010 (UTC)&lt;br /&gt;
# Research Group working on Simultaneous Multithreading [http://www.cs.washington.edu/research/smt/ Simultaneous Multithreading] --[[User:Hirving|Hirving]] 19:58, 7 October 2010 (UTC)&lt;br /&gt;
# This site provides in-depth info about threads, threads-pooling, scheduling: http://msdn.microsoft.com/en-us/library/ms684841(VS.85).aspx [[Paul]]&lt;br /&gt;
# Here is another site that outlines THREAD designs and techniques: http://people.csail.mit.edu/rinard/osnotes/h2.html [[Paul]]&lt;br /&gt;
# [http://www.cosc.brocku.ca/Offerings/4P13/slides/threads.ppt Interesting presentation: really worth checking out]  [[Paul]]&lt;br /&gt;
# KERNEL vs USERMODE http://www.wordiq.com/definition/Thread_(computer_science)--[[User:Praubic|Praubic]] 18:06, 10 October 2010 (UTC)&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.7621&amp;amp;rep=rep1&amp;amp;type=pdf#page=83 Scalability in linux]&lt;br /&gt;
# [http://hillside.net/plop/2007/papers/PLoP2007_Ahluwalia.pdf This has something to do with our question...]&lt;br /&gt;
# [http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx Scheduling Priorities (Windows)], Microsoft (23 September 2010) --[[User:Spanke|Shane]]&lt;br /&gt;
# [http://www.novell.com/coolsolutions/feature/14878.html Linux Scheduling Priorities Explained], Novell (11 October 2005) --[[User:Spanke|Shane]]&lt;br /&gt;
# [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/ Inside the Linux 2.6 Completely Fair Scheduler], IBM (15 December 2009) --[[User:Spanke|Shane]]&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;diff=3617</id>
		<title>Talk:COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;diff=3617"/>
		<updated>2010-10-14T04:34:55Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Essay Outline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Log ==&lt;br /&gt;
&#039;&#039;&#039;Suggestion:&#039;&#039;&#039; Let us maintain our edits here instead of on littering the main page with our names. Also please do not edit without writing to the log so that we know who has done what and when.&lt;br /&gt;
&lt;br /&gt;
Please maintain a log of your activities in the Log Section. So that we can keep track of the evolution of the essay. --[[User:Gautam|Gautam]]&lt;br /&gt;
&lt;br /&gt;
Moved around some info for clarity. Everyone should post your interpretation of the question in simplest possible English so we`re on the same page (as someone, maybe me, seems to have the wrong idea about what we`re trying to talk about) &lt;br /&gt;
More moving for clarity. added an essay outline at bottom (feel free to change)&lt;br /&gt;
filled in the outline somewhat added questions to the outline for everyone to think on.--[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
First Draft for essay. Please modify and add on. --[[User:Gautam|Gautam]] 02:46, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Edited Scheduling Priorities and rewrote some areas to provide a better paragraph structure. --[[User:Spanke|Shane]] 15:25, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Added to the memory management section. --[[User:Hirving|Hirving]] 21:42, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Edited Scalable Threads Problems. Also did a little re-arrangement. --[[User:Gautam|Gautam]] 01:03, 14 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;Add your future activities here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== The Question ==&lt;br /&gt;
&#039;&#039;&#039;Original:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Rannath:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The question seems to be about number and scalability of threads not the gross mechanics.&lt;br /&gt;
&lt;br /&gt;
To be more clear: we can limit ourselves from the thread implementations to the thread scalability... ignore the stuff that required for all threads, unless its required for many threads. (I didn&#039;t find any implementations that required hardware)&lt;br /&gt;
&lt;br /&gt;
I would also argue that since OSs have to run on multiple hardwares one cannot guarantee that unique/rare hardware bits will be there. While we can talk about hardware we should limit it to a mention at most. OR we could mention prospective hardware that could help out, but is not yet standard. It depends on whether we want to do &amp;quot;as it is&amp;quot; or &amp;quot;as it might be&amp;quot;&lt;br /&gt;
&lt;br /&gt;
utility of such massively scalable thread implementations. I took this as: what functionality (of single strings) does one have to give up to make threads scalable.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Gautam:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
I think the hardware is as relevant as the software. Not all things can be done in software and hardware support is an important factor in most of the solutions to many problems that OS face. My take.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Henry:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Since the question is about the system as a whole, I think the answer should include both software and hardware support for large amounts of threads. The questions revolves around how a system can handle millions of threads and what are the major factors that allow the system to do it. Also, the last part of the question seems to ask what this amount of threads allows a process to do.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Shane:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
In response to the above&#039;s idea on the last part of the question, I would argue that it would enable fast execution because all threads that receive a cache miss would be picked up by the other threads so long as there was enough resources. Also the use of more threads would help synchronize the cache (through sharing) so that it would not miss. Of course this would be if they were assigned to the same task, you cannot sync threads running different applications it just wouldn&#039;t make sense. The only issue with this idea is the software must support this number.&lt;br /&gt;
&lt;br /&gt;
== Group 7 ==&lt;br /&gt;
&lt;br /&gt;
Let us start out by listing down our names and email id (preffered). &lt;br /&gt;
&lt;br /&gt;
Gautam Akiwate         &amp;lt;gautam.akiwate@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Patrick Young(rannath) &amp;lt;rannath@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
vG Vivek &amp;lt;support.tamiltreasure@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shane Panke &amp;lt;shanepanke@msn.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Henry Irving &amp;lt;sens.henry@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Paul Raubic &amp;lt;paul_raubic@hotmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Guidelines ==&lt;br /&gt;
&lt;br /&gt;
Raw info should have some indication of where you got it for citation.&lt;br /&gt;
&lt;br /&gt;
Claim your info so we don&#039;t need to dig for who got what when we need clarification.&lt;br /&gt;
&lt;br /&gt;
Feel free to provide info for or edit someone else&#039;s info, just keep their signature so we can discuss changes&lt;br /&gt;
&lt;br /&gt;
sign changes (once) preferably without time stamps Ex: --[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
Please maintain a log of your activities in the Log Section. So that we can keep track of the evolution of the essay. --[[User:Gautam|Gautam]]&lt;br /&gt;
&lt;br /&gt;
== Facts We have ==&lt;br /&gt;
Start by placing the info here so we can sort through it. I&#039;m going to go into full research/essay writing mode on Sunday if there isn&#039;t enough here.&lt;br /&gt;
&lt;br /&gt;
So far we have:&lt;br /&gt;
Three design choices I&#039;ve seen:&lt;br /&gt;
# Smallest possible footprint per-thread (being extremely light weight) - from everywhere&lt;br /&gt;
# least number (none if at all possible) of context switches per-thread - &#039;&#039;5&#039;&#039;&lt;br /&gt;
# use of a &amp;quot;thread pool&amp;quot; - &#039;&#039;3&#039;&#039;&lt;br /&gt;
The idea is to reduce processor time and storage needed per-thread so you can have more in the same amount of space. --[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
Multi-threading is a term used to describe:&lt;br /&gt;
&lt;br /&gt;
* A facility provided by the operating system that enables an application to create threads of execution within a process&lt;br /&gt;
* Applications whose architecture takes advantage of the multi-threading provided by the operating system &lt;br /&gt;
[[vG]]&lt;br /&gt;
----&lt;br /&gt;
These are all related ideas.&lt;br /&gt;
&lt;br /&gt;
Ok, since we are discussing design choices maybe we could also elaborate on the two major types of threads. Here, I already wrote a few lines, source can be found in citation section: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Fibers (user mode threads) provide very quick and efficient switching because there is no need for a system call and kernel is oblivious to a switch - allows for millions of user mode threads. ISSUES: Blocking system calls disables all other fibers.&lt;br /&gt;
On the other hand managing threads through the kernel requires context switch (between user and kernel mode) on creation and removal of a thread therefore programs with prodigious number of threads would suffer huge performance hits.--[[User:Praubic|Praubic]] 18:05, 10 October 2010 (UTC)&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
User-mode scheduling (UMS) is a light-weight mechanism that applications can use to schedule their own threads. The ability to switch between threads in user mode makes UMS more efficient than thread pools for short-duration work items that require few system calls. [[Paul]]&lt;br /&gt;
&lt;br /&gt;
One implementation of UMS is: combination of N:N and N:M, where the N:N relationship reveals N false processors to the user-space so the user can deal with scheduling on their own. &#039;&#039;5&#039;&#039; -[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
I would scrap the first two below, at most mention them...&lt;br /&gt;
&lt;br /&gt;
#time-division multiplexing&lt;br /&gt;
#threads vs processes&lt;br /&gt;
#I/O Scheduling -[[vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Splitting this off because I don&#039;t think it&#039;s technically part of the answer&amp;lt;br&amp;gt;&lt;br /&gt;
Multithreading generally occurs by time-division multiplexing. It makes it possible for the processor to switch between different threads but it happens so fast that the user sees it as it is running at the same time. [[User:vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
Things that we &#039;&#039;&#039;need&#039;&#039;&#039; to cover in the essay:--[[User:Gautam|Gautam]] 19:35, 7 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is a &#039;&#039;&#039;need&#039;&#039;&#039; section 4 below is not &#039;&#039;&#039;needed&#039;&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
(A)Design Decisions &lt;br /&gt;
   1. Type of threading (1:1 1:N M:N)&lt;br /&gt;
   2. Signal handling - we might be able to leave this out as it seems some &amp;quot;light weight&amp;quot; threads use no signals&lt;br /&gt;
   3. Synchronisation&lt;br /&gt;
   4. Memory Handling&lt;br /&gt;
   5. Scheduling Priorities (context switching and how it affects the CPU threading process)[[Paul]]&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Things we might want also to cover in the essay (non-essentials here): --[[User:Rannath|Rannath]] 04:43, 10 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
(A)Design Decisions &lt;br /&gt;
   1. Brief History of threading&lt;br /&gt;
   2. examples of attempts at getting absurd numbers of threads (failures)&lt;br /&gt;
   3. other types of threading, including heavy weight and processes&lt;br /&gt;
   4. Examples of systems that require many threads such as mainframe servers or banking client processing.--[[User:Praubic|Praubic]] 17:34, 11 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Here is an example of a design: (the topic asks for key design choices here is one)&lt;br /&gt;
&lt;br /&gt;
Capriccio is a specific design for scalable user level threads. They are distinct from most designs by being independent of event based mechanisms as well as kernel thread models. They are very good choice for internet servers and this implementations could easily support 100,000 threads. They are characterized by high scalability, efficient stack management and scheduling based on resource usage however the performance is not comparable to event-based systems.--[[User:Praubic|Praubic]] 13:32, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
(B)Kernel &lt;br /&gt;
   1. Program Thread manipulation through system calls --[[User:Hirving|Hirving]] 20:05, 7 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
(C)Hardware --[[User:Hirving|Hirving]] 19:55, 7 October 2010 (UTC)&lt;br /&gt;
   1. Simultaneous Multithreading&lt;br /&gt;
   2. Multi-core processors&lt;br /&gt;
&lt;br /&gt;
== Essay Outline ==&lt;br /&gt;
&lt;br /&gt;
#Thesis is an answer to the question so... that&#039;s the first step, or the last step, we can always present our info and make our thesis match the info.&lt;br /&gt;
#List all questions and points we have about the topic&lt;br /&gt;
&lt;br /&gt;
Questions:&lt;br /&gt;
#What makes threads non-scalable? List the problems&lt;br /&gt;
#What utility do some scalable implementations lack? Why?&lt;br /&gt;
#Just how scalable does a full utility implementation get?&lt;br /&gt;
&lt;br /&gt;
Answers:&lt;br /&gt;
# &lt;br /&gt;
# Signals, portability(maybe) both add overhead which would slow down threads&lt;br /&gt;
#&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Intro (fill in info)&lt;br /&gt;
# Thesis&lt;br /&gt;
# main topics &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Body (made of many main points)&lt;br /&gt;
&lt;br /&gt;
Main Point 1 -[[Rannath]]&amp;lt;br&amp;gt;&lt;br /&gt;
- efficient thread creation/destruction is more scalable&amp;lt;br&amp;gt;&lt;br /&gt;
-- NPTL&#039;s improvements over LinuxThreads- primarily due to lower overhead of creation/destruction &#039;&#039;1&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 2 -[[Rannath]]&amp;lt;br&amp;gt;&lt;br /&gt;
- UMS &amp;amp; user-space threads are more scalable - maybe&amp;lt;br&amp;gt;&lt;br /&gt;
-- context switches are costly &#039;&#039;From class&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- blocking locks have lower latency when twinned with a user space scheduler &#039;&#039;8&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Ok for point 2 -&amp;gt; I posted a draft on the essay page but Im not certain as to whether i should talk about fibers since they are also functioning on user space but theyre not UMS. --[[User:Praubic|Praubic]] 00:18, 14 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Main Point 3&amp;lt;br&amp;gt;&lt;br /&gt;
- Certain bottleneck appear in scaled implementations, removing these improves scalability.&amp;lt;br&amp;gt;&lt;br /&gt;
-- &amp;quot;False cache-line sharing&amp;quot; &#039;&#039;14&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- xtime lock to a lockless lock &#039;&#039;14&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 3.5&amp;lt;br&amp;gt;&lt;br /&gt;
Fine-Grain over course-grain&amp;lt;br&amp;gt;&lt;br /&gt;
-- &amp;quot;Big Kernel Lock&amp;quot; &#039;&#039;14&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- dcache_lock &#039;&#039;14&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Link the Main points to the thesis&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Conclusion&lt;br /&gt;
# restate info&lt;br /&gt;
# affirmation of thesis&lt;br /&gt;
&lt;br /&gt;
Here is the first paragraph that I attempted. Please feel free to change or even delete it from here. &lt;br /&gt;
&lt;br /&gt;
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable User threads on the other hand are mapped to kernel threads by the threads library such as libpthreads. and there are a few designs that incorporate it mainly Fibers and UMS (User Mode Scheudling) which allow for very high scalability.  UMS threads have their own context and resources however the ability to switch in the user mode makes them more efficient (depending on  application) than Thread Pools which are yet another mechanism that allows for high scalability.&lt;br /&gt;
--[[User:Praubic|Praubic]] 19:04, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
we can add this for intro paragraph:&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process?&lt;br /&gt;
&lt;br /&gt;
It is possible for systems to supports millions of threads or more within a single processor, it has the ability to switch execution resource between threads, thus making a concurrent execution. Concurrency is when multiple threads stays on the ques for switching but incapable of running at the same time but it has the ability to make it look like they are running at same time due to the speed they switch. [[vG]] --[[User:Spanke|Shane]] You stated it is possible you did not state how, or rather did not make it clear. The below should be a better interpretation.&lt;br /&gt;
&lt;br /&gt;
Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
I suggest that we start filling out the main points of the essay. We can discuss the intricacies as we go along. --[[User:Gautam|Gautam]] 02:46, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== Sources ==&lt;br /&gt;
&lt;br /&gt;
# Short history of threads in Linux and new implementation of them. [http://www.drdobbs.com/open-source/184406204;jsessionid=3MRSO5YMO1QVRQE1GHRSKHWATMY32JVN NPTL: The New Implementation of Threads for Linux ] [[User:Gautam|Gautam]] 22:18, 5 October 2010 (UTC)&lt;br /&gt;
# This paper discusses the design choices [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.6590&amp;amp;rep=rep1&amp;amp;type=pdf Native POSIX Threads] [[User:Gautam|Gautam]] 22:11, 5 October 2010 (UTC)&lt;br /&gt;
# lightweight threads vs kernel threads [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.9043&amp;amp;rep=rep1&amp;amp;type=pdf PicoThreads: Lightweight Threads in Java] --[[User:Rannath|Rannath]] 00:23, 6 October 2010 (UTC)&lt;br /&gt;
# [http://eigenclass.org/http://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;amp;action=edit&amp;amp;section=7hiki/lightweight-threads-with-lwt Eigenclass Comparing lightweight threads] --[[User:Rannath|Rannath]] 00:23, 6 October 2010 (UTC)&lt;br /&gt;
# A lightwight thread implementation for Unix [http://www.usenix.org/publications/library/proceedings/sa92/stein.pdf Implementing light weight threads] --[[User:Rannath|Rannath]] 00:49, 6 October 2010 (UTC) [[User:Gbint|Gbint]] 19:50, 5 October 2010 (UTC)&lt;br /&gt;
#Not in this group, but I thought that this paper was excellent: [http://www.sandia.gov/~rcmurph/doc/qt_paper.pdf Qthreads: An API for Programming with Millions of Lightweight Threads]&lt;br /&gt;
# Difference between single and multi threading [http://wiki.answers.com/Q/Single_threaded_Process_and_Multi-threaded_Process] [[vG]]&lt;br /&gt;
# [http://hdl.handle.net/1853/6804 Implementation of Scalable Blocking Locks using an Adaptative Thread Scheduler] --[[User:Gautam|Gautam]] 19:35, 7 October 2010 (UTC)&lt;br /&gt;
# Research Group working on Simultaneous Multithreading [http://www.cs.washington.edu/research/smt/ Simultaneous Multithreading] --[[User:Hirving|Hirving]] 19:58, 7 October 2010 (UTC)&lt;br /&gt;
# This site provides in-depth info about threads, threads-pooling, scheduling: http://msdn.microsoft.com/en-us/library/ms684841(VS.85).aspx [[Paul]]&lt;br /&gt;
# Here is another site that outlines THREAD designs and techniques: http://people.csail.mit.edu/rinard/osnotes/h2.html [[Paul]]&lt;br /&gt;
# [http://www.cosc.brocku.ca/Offerings/4P13/slides/threads.ppt Interesting presentation: really worth checking out]  [[Paul]]&lt;br /&gt;
# KERNEL vs USERMODE http://www.wordiq.com/definition/Thread_(computer_science)--[[User:Praubic|Praubic]] 18:06, 10 October 2010 (UTC)&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.7621&amp;amp;rep=rep1&amp;amp;type=pdf#page=83 Scalability in linux]&lt;br /&gt;
# [http://hillside.net/plop/2007/papers/PLoP2007_Ahluwalia.pdf This has something to do with our question...]&lt;br /&gt;
# [http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx Scheduling Priorities (Windows)], Microsoft (23 September 2010) --[[User:Spanke|Shane]]&lt;br /&gt;
# [http://www.novell.com/coolsolutions/feature/14878.html Linux Scheduling Priorities Explained], Novell (11 October 2005) --[[User:Spanke|Shane]]&lt;br /&gt;
# [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/ Inside the Linux 2.6 Completely Fair Scheduler], IBM (15 December 2009) --[[User:Spanke|Shane]]&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;diff=3604</id>
		<title>Talk:COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;diff=3604"/>
		<updated>2010-10-14T04:26:11Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Log */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Log ==&lt;br /&gt;
&#039;&#039;&#039;Suggestion:&#039;&#039;&#039; Let us maintain our edits here instead of on littering the main page with our names. Also please do not edit without writing to the log so that we know who has done what and when.&lt;br /&gt;
&lt;br /&gt;
Please maintain a log of your activities in the Log Section. So that we can keep track of the evolution of the essay. --[[User:Gautam|Gautam]]&lt;br /&gt;
&lt;br /&gt;
Moved around some info for clarity. Everyone should post your interpretation of the question in simplest possible English so we`re on the same page (as someone, maybe me, seems to have the wrong idea about what we`re trying to talk about) &lt;br /&gt;
More moving for clarity. added an essay outline at bottom (feel free to change)&lt;br /&gt;
filled in the outline somewhat added questions to the outline for everyone to think on.--[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
First Draft for essay. Please modify and add on. --[[User:Gautam|Gautam]] 02:46, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Edited Scheduling Priorities and rewrote some areas to provide a better paragraph structure. --[[User:Spanke|Shane]] 15:25, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Added to the memory management section. --[[User:Hirving|Hirving]] 21:42, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Edited Scalable Threads Problems. Also did a little re-arrangement. --[[User:Gautam|Gautam]] 01:03, 14 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;Add your future activities here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== The Question ==&lt;br /&gt;
&#039;&#039;&#039;Original:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Rannath:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The question seems to be about number and scalability of threads not the gross mechanics.&lt;br /&gt;
&lt;br /&gt;
To be more clear: we can limit ourselves from the thread implementations to the thread scalability... ignore the stuff that required for all threads, unless its required for many threads. (I didn&#039;t find any implementations that required hardware)&lt;br /&gt;
&lt;br /&gt;
I would also argue that since OSs have to run on multiple hardwares one cannot guarantee that unique/rare hardware bits will be there. While we can talk about hardware we should limit it to a mention at most. OR we could mention prospective hardware that could help out, but is not yet standard. It depends on whether we want to do &amp;quot;as it is&amp;quot; or &amp;quot;as it might be&amp;quot;&lt;br /&gt;
&lt;br /&gt;
utility of such massively scalable thread implementations. I took this as: what functionality (of single strings) does one have to give up to make threads scalable.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Gautam:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
I think the hardware is as relevant as the software. Not all things can be done in software and hardware support is an important factor in most of the solutions to many problems that OS face. My take.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Henry:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Since the question is about the system as a whole, I think the answer should include both software and hardware support for large amounts of threads. The questions revolves around how a system can handle millions of threads and what are the major factors that allow the system to do it. Also, the last part of the question seems to ask what this amount of threads allows a process to do.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Shane:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
In response to the above&#039;s idea on the last part of the question, I would argue that it would enable fast execution because all threads that receive a cache miss would be picked up by the other threads so long as there was enough resources. Also the use of more threads would help synchronize the cache (through sharing) so that it would not miss. Of course this would be if they were assigned to the same task, you cannot sync threads running different applications it just wouldn&#039;t make sense. The only issue with this idea is the software must support this number.&lt;br /&gt;
&lt;br /&gt;
== Group 7 ==&lt;br /&gt;
&lt;br /&gt;
Let us start out by listing down our names and email id (preffered). &lt;br /&gt;
&lt;br /&gt;
Gautam Akiwate         &amp;lt;gautam.akiwate@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Patrick Young(rannath) &amp;lt;rannath@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
vG Vivek &amp;lt;support.tamiltreasure@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shane Panke &amp;lt;shanepanke@msn.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Henry Irving &amp;lt;sens.henry@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Paul Raubic &amp;lt;paul_raubic@hotmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Guidelines ==&lt;br /&gt;
&lt;br /&gt;
Raw info should have some indication of where you got it for citation.&lt;br /&gt;
&lt;br /&gt;
Claim your info so we don&#039;t need to dig for who got what when we need clarification.&lt;br /&gt;
&lt;br /&gt;
Feel free to provide info for or edit someone else&#039;s info, just keep their signature so we can discuss changes&lt;br /&gt;
&lt;br /&gt;
sign changes (once) preferably without time stamps Ex: --[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
Please maintain a log of your activities in the Log Section. So that we can keep track of the evolution of the essay. --[[User:Gautam|Gautam]]&lt;br /&gt;
&lt;br /&gt;
== Facts We have ==&lt;br /&gt;
Start by placing the info here so we can sort through it. I&#039;m going to go into full research/essay writing mode on Sunday if there isn&#039;t enough here.&lt;br /&gt;
&lt;br /&gt;
So far we have:&lt;br /&gt;
Three design choices I&#039;ve seen:&lt;br /&gt;
# Smallest possible footprint per-thread (being extremely light weight) - from everywhere&lt;br /&gt;
# least number (none if at all possible) of context switches per-thread - &#039;&#039;5&#039;&#039;&lt;br /&gt;
# use of a &amp;quot;thread pool&amp;quot; - &#039;&#039;3&#039;&#039;&lt;br /&gt;
The idea is to reduce processor time and storage needed per-thread so you can have more in the same amount of space. --[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
Multi-threading is a term used to describe:&lt;br /&gt;
&lt;br /&gt;
* A facility provided by the operating system that enables an application to create threads of execution within a process&lt;br /&gt;
* Applications whose architecture takes advantage of the multi-threading provided by the operating system &lt;br /&gt;
[[vG]]&lt;br /&gt;
----&lt;br /&gt;
These are all related ideas.&lt;br /&gt;
&lt;br /&gt;
Ok, since we are discussing design choices maybe we could also elaborate on the two major types of threads. Here, I already wrote a few lines, source can be found in citation section: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Fibers (user mode threads) provide very quick and efficient switching because there is no need for a system call and kernel is oblivious to a switch - allows for millions of user mode threads. ISSUES: Blocking system calls disables all other fibers.&lt;br /&gt;
On the other hand managing threads through the kernel requires context switch (between user and kernel mode) on creation and removal of a thread therefore programs with prodigious number of threads would suffer huge performance hits.--[[User:Praubic|Praubic]] 18:05, 10 October 2010 (UTC)&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
User-mode scheduling (UMS) is a light-weight mechanism that applications can use to schedule their own threads. The ability to switch between threads in user mode makes UMS more efficient than thread pools for short-duration work items that require few system calls. [[Paul]]&lt;br /&gt;
&lt;br /&gt;
One implementation of UMS is: combination of N:N and N:M, where the N:N relationship reveals N false processors to the user-space so the user can deal with scheduling on their own. &#039;&#039;5&#039;&#039; -[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
I would scrap the first two below, at most mention them...&lt;br /&gt;
&lt;br /&gt;
#time-division multiplexing&lt;br /&gt;
#threads vs processes&lt;br /&gt;
#I/O Scheduling -[[vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Splitting this off because I don&#039;t think it&#039;s technically part of the answer&amp;lt;br&amp;gt;&lt;br /&gt;
Multithreading generally occurs by time-division multiplexing. It makes it possible for the processor to switch between different threads but it happens so fast that the user sees it as it is running at the same time. [[User:vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
Things that we &#039;&#039;&#039;need&#039;&#039;&#039; to cover in the essay:--[[User:Gautam|Gautam]] 19:35, 7 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is a &#039;&#039;&#039;need&#039;&#039;&#039; section 4 below is not &#039;&#039;&#039;needed&#039;&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
(A)Design Decisions &lt;br /&gt;
   1. Type of threading (1:1 1:N M:N)&lt;br /&gt;
   2. Signal handling - we might be able to leave this out as it seems some &amp;quot;light weight&amp;quot; threads use no signals&lt;br /&gt;
   3. Synchronisation&lt;br /&gt;
   4. Memory Handling&lt;br /&gt;
   5. Scheduling Priorities (context switching and how it affects the CPU threading process)[[Paul]]&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Things we might want also to cover in the essay (non-essentials here): --[[User:Rannath|Rannath]] 04:43, 10 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
(A)Design Decisions &lt;br /&gt;
   1. Brief History of threading&lt;br /&gt;
   2. examples of attempts at getting absurd numbers of threads (failures)&lt;br /&gt;
   3. other types of threading, including heavy weight and processes&lt;br /&gt;
   4. Examples of systems that require many threads such as mainframe servers or banking client processing.--[[User:Praubic|Praubic]] 17:34, 11 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Here is an example of a design: (the topic asks for key design choices here is one)&lt;br /&gt;
&lt;br /&gt;
Capriccio is a specific design for scalable user level threads. They are distinct from most designs by being independent of event based mechanisms as well as kernel thread models. They are very good choice for internet servers and this implementations could easily support 100,000 threads. They are characterized by high scalability, efficient stack management and scheduling based on resource usage however the performance is not comparable to event-based systems.--[[User:Praubic|Praubic]] 13:32, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
(B)Kernel &lt;br /&gt;
   1. Program Thread manipulation through system calls --[[User:Hirving|Hirving]] 20:05, 7 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
(C)Hardware --[[User:Hirving|Hirving]] 19:55, 7 October 2010 (UTC)&lt;br /&gt;
   1. Simultaneous Multithreading&lt;br /&gt;
   2. Multi-core processors&lt;br /&gt;
&lt;br /&gt;
== Essay Outline ==&lt;br /&gt;
&lt;br /&gt;
#Thesis is an answer to the question so... that&#039;s the first step, or the last step, we can always present our info and make our thesis match the info.&lt;br /&gt;
#List all questions and points we have about the topic&lt;br /&gt;
&lt;br /&gt;
Questions:&lt;br /&gt;
#What makes threads non-scalable? List the problems&lt;br /&gt;
#What utility do some scalable implementations lack? Why?&lt;br /&gt;
#Just how scalable does a full utility implementation get?&lt;br /&gt;
&lt;br /&gt;
Answers:&lt;br /&gt;
# &lt;br /&gt;
# Signals, portability(maybe) both add overhead which would slow down threads&lt;br /&gt;
#&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Intro (fill in info)&lt;br /&gt;
# Thesis&lt;br /&gt;
# main topics &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Body (made of many main points)&lt;br /&gt;
&lt;br /&gt;
Main Point 1 -[[Rannath]]&amp;lt;br&amp;gt;&lt;br /&gt;
- efficient thread creation/destruction is more scalable&amp;lt;br&amp;gt;&lt;br /&gt;
-- NPTL&#039;s improvements over LinuxThreads- primarily due to lower overhead of creation/destruction &#039;&#039;1&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 2 -[[Rannath]]&amp;lt;br&amp;gt;&lt;br /&gt;
- UMS &amp;amp; user-space threads are more scalable - maybe&amp;lt;br&amp;gt;&lt;br /&gt;
-- context switches are costly &#039;&#039;From class&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- blocking locks have lower latency when twinned with a user space scheduler &#039;&#039;8&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Ok for point 2 -&amp;gt; I posted a draft on the essay page but Im not certain as to whether i should talk about fibers since they are also functioning on user space but theyre not UMS. --[[User:Praubic|Praubic]] 00:18, 14 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Main Point 3&amp;lt;br&amp;gt;&lt;br /&gt;
- Certain bottleneck appear in scaled implementations, removing these improves scalability.&amp;lt;br&amp;gt;&lt;br /&gt;
-- &amp;quot;False cache-line sharing&amp;quot; &#039;&#039;14&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- xtime lock to a lockless lock &#039;&#039;14&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 3.5&amp;lt;br&amp;gt;&lt;br /&gt;
Fine-Grain over course-grain&amp;lt;br&amp;gt;&lt;br /&gt;
-- &amp;quot;Big Kernel Lock&amp;quot; &#039;&#039;14&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- dcache_lock &#039;&#039;14&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Link the Main points to the thesis&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Conclusion&lt;br /&gt;
# restate info&lt;br /&gt;
# affirmation of thesis&lt;br /&gt;
&lt;br /&gt;
Here is the first paragraph that I attempted. Please feel free to change or even delete it from here. &lt;br /&gt;
&lt;br /&gt;
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable User threads on the other hand are mapped to kernel threads by the threads library such as libpthreads. and there are a few designs that incorporate it mainly Fibers and UMS (User Mode Scheudling) which allow for very high scalability.  UMS threads have their own context and resources however the ability to switch in the user mode makes them more efficient (depending on  application) than Thread Pools which are yet another mechanism that allows for high scalability.&lt;br /&gt;
--[[User:Praubic|Praubic]] 19:04, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
we can add this for intro paragraph:&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process?&lt;br /&gt;
&lt;br /&gt;
It is possible for systems to supports millions of threads or more within a single processor, it has the ability to switch execution resource between threads, thus making a concurrent execution. Concurrency is when multiple threads stays on the ques for switching but incapable of running at the same time but it has the ability to make it look like they are running at same time due to the speed they switch. [[vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
I suggest that we start filling out the main points of the essay. We can discuss the intricacies as we go along. --[[User:Gautam|Gautam]] 02:46, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== Sources ==&lt;br /&gt;
&lt;br /&gt;
# Short history of threads in Linux and new implementation of them. [http://www.drdobbs.com/open-source/184406204;jsessionid=3MRSO5YMO1QVRQE1GHRSKHWATMY32JVN NPTL: The New Implementation of Threads for Linux ] [[User:Gautam|Gautam]] 22:18, 5 October 2010 (UTC)&lt;br /&gt;
# This paper discusses the design choices [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.6590&amp;amp;rep=rep1&amp;amp;type=pdf Native POSIX Threads] [[User:Gautam|Gautam]] 22:11, 5 October 2010 (UTC)&lt;br /&gt;
# lightweight threads vs kernel threads [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.9043&amp;amp;rep=rep1&amp;amp;type=pdf PicoThreads: Lightweight Threads in Java] --[[User:Rannath|Rannath]] 00:23, 6 October 2010 (UTC)&lt;br /&gt;
# [http://eigenclass.org/http://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;amp;action=edit&amp;amp;section=7hiki/lightweight-threads-with-lwt Eigenclass Comparing lightweight threads] --[[User:Rannath|Rannath]] 00:23, 6 October 2010 (UTC)&lt;br /&gt;
# A lightwight thread implementation for Unix [http://www.usenix.org/publications/library/proceedings/sa92/stein.pdf Implementing light weight threads] --[[User:Rannath|Rannath]] 00:49, 6 October 2010 (UTC) [[User:Gbint|Gbint]] 19:50, 5 October 2010 (UTC)&lt;br /&gt;
#Not in this group, but I thought that this paper was excellent: [http://www.sandia.gov/~rcmurph/doc/qt_paper.pdf Qthreads: An API for Programming with Millions of Lightweight Threads]&lt;br /&gt;
# Difference between single and multi threading [http://wiki.answers.com/Q/Single_threaded_Process_and_Multi-threaded_Process] [[vG]]&lt;br /&gt;
# [http://hdl.handle.net/1853/6804 Implementation of Scalable Blocking Locks using an Adaptative Thread Scheduler] --[[User:Gautam|Gautam]] 19:35, 7 October 2010 (UTC)&lt;br /&gt;
# Research Group working on Simultaneous Multithreading [http://www.cs.washington.edu/research/smt/ Simultaneous Multithreading] --[[User:Hirving|Hirving]] 19:58, 7 October 2010 (UTC)&lt;br /&gt;
# This site provides in-depth info about threads, threads-pooling, scheduling: http://msdn.microsoft.com/en-us/library/ms684841(VS.85).aspx [[Paul]]&lt;br /&gt;
# Here is another site that outlines THREAD designs and techniques: http://people.csail.mit.edu/rinard/osnotes/h2.html [[Paul]]&lt;br /&gt;
# [http://www.cosc.brocku.ca/Offerings/4P13/slides/threads.ppt Interesting presentation: really worth checking out]  [[Paul]]&lt;br /&gt;
# KERNEL vs USERMODE http://www.wordiq.com/definition/Thread_(computer_science)--[[User:Praubic|Praubic]] 18:06, 10 October 2010 (UTC)&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.7621&amp;amp;rep=rep1&amp;amp;type=pdf#page=83 Scalability in linux]&lt;br /&gt;
# [http://hillside.net/plop/2007/papers/PLoP2007_Ahluwalia.pdf This has something to do with our question...]&lt;br /&gt;
# [http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx Scheduling Priorities (Windows)], Microsoft (23 September 2010) --[[User:Spanke|Shane]]&lt;br /&gt;
# [http://www.novell.com/coolsolutions/feature/14878.html Linux Scheduling Priorities Explained], Novell (11 October 2005) --[[User:Spanke|Shane]]&lt;br /&gt;
# [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/ Inside the Linux 2.6 Completely Fair Scheduler], IBM (15 December 2009) --[[User:Spanke|Shane]]&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3600</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3600"/>
		<updated>2010-10-14T04:24:10Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. No changes made as of yet.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Scalable Threads: The Problems ==&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them):&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Cache-Coherency:&#039;&#039;&#039;There can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is comparatively expensive. Once misplaced information that causes this problem is identified it can be moved to limit the problem.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Locks:&#039;&#039;&#039;There can also be some user-called locks that contribute to bottlenecks. One such lock is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, leading to starvation. This problem was solved by using a lockless-read.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Scheduler:&#039;&#039;&#039;The multiqueue scheduler is the third major bottle neck. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time. Whilem, the rest went into computing and recomputing information in the cache. These problems were fixed by replacing the scheduler,. The scheduler was then replaced by a more efficient scheduler [O(1) scheduler].&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;CPU:&#039;&#039;&#039;The next few bottle necks are related. They&#039;re both examples of course-granularity locks eating CPU time. Granularity refers to the execution time of a code segment. The closer a segment is to the speed of an atomic action the finer its granularity. Another course-grained bottleneck was the dcache_lock. It ate up a adequate amount of time in normal use. But it was also called in the much more popular dnotify_parent() function, which made it unacceptable. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Kernel Lock:&#039;&#039;&#039;One big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottle necks.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;MAIN POINT 2 Paragraph draft&#039;&#039;&#039; --[[User:Praubic|Praubic]] 00:21, 14 October 2010 (UTC) still in progress and debating &lt;br /&gt;
&lt;br /&gt;
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing.  UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes.&lt;br /&gt;
&lt;br /&gt;
Ok for point 2 -&amp;gt; I posted a draft on the essay page but Im not certain as to whether i should talk about fibers since they are also functioning on user space but theyre not UMS. --Praubic&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039; --[[User:Gautam|Gautam]] 00:29, 14 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread&#039;s data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. &lt;br /&gt;
	The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it&#039;s stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it&#039;s control block, this allows for the thread&#039;s stack to be allocated at the thread&#039;s startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it&#039;s arguments out of it&#039;s control block and allocate it&#039;s memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.   &lt;br /&gt;
	The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [http://portal.acm.org/citation.cfm?id=75378] [[hirving]]&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency. --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3327</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3327"/>
		<updated>2010-10-13T19:18:29Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. No changes made as of yet.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them):&lt;br /&gt;
&lt;br /&gt;
There can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is comparatively expensive. Once misplaced information that causes this problem is identified it can be moved to limit the problem.&lt;br /&gt;
&lt;br /&gt;
There can also be some user-called locks that contribute to bottlenecks. One such lock is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, leading to starvation. This problem was solved by using a lockless-read.&lt;br /&gt;
&lt;br /&gt;
The multiqueue scheduler is the third major bottle neck. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time. Whilem, the rest went into computing and recomputing information in the cache. These problems were fixed by replacing the scheduler,. The scheduler was then replaced by a more efficient scheduler [O(1) scheduler].&lt;br /&gt;
&lt;br /&gt;
The next few bottle necks are related. They&#039;re both examples of course-granularity locks eating CPU time. Granularity refers to the execution time of a code segment. The closer a segment is to the speed of an atomic action the finer its granularity.&lt;br /&gt;
&lt;br /&gt;
One big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottle necks.&lt;br /&gt;
&lt;br /&gt;
The last course-grained bottleneck was the dcache_lock. It ate up a adequate amount of time in normal use. But it was also called in the much more popular dnotify_parent() function, which made it unacceptable. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
One of the goals for the library is to have low startup costs for threads so that scalability is possible. The biggest problem, time-wise, outside the kernel is the memory needed for the thread data structures, thread-local storage, and the stack. This is corrected by optimizing the memory allocation for the threads.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;Working on this section&amp;gt; [[hirving]]&#039;&#039;&#039;&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency. --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC) --[[Spanke|Shane]] &#039;&#039;&#039;Revised it a bit, scheduling priority does not use number range in Linux.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3326</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3326"/>
		<updated>2010-10-13T19:17:22Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. No changes made as of yet.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them):&lt;br /&gt;
&lt;br /&gt;
There can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is comparatively expensive. Once misplaced information that causes this problem is identified it can be moved to limit the problem.&lt;br /&gt;
&lt;br /&gt;
There can also be some user-called locks that contribute to bottlenecks. One such lock is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, leading to starvation. This problem was solved by using a lockless-read.&lt;br /&gt;
&lt;br /&gt;
The multiqueue scheduler is the third major bottle neck. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time. Whilem, the rest went into computing and recomputing information in the cache. These problems were fixed by replacing the scheduler,. The scheduler was then replaced by a more efficient scheduler [O(1) scheduler].&lt;br /&gt;
&lt;br /&gt;
The next few bottle necks are related. They&#039;re both examples of course-granularity locks eating CPU time. Granularity refers to the execution time of a code segment. The closer a segment is to the speed of an atomic action the finer its granularity.&lt;br /&gt;
&lt;br /&gt;
One big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottle necks.&lt;br /&gt;
&lt;br /&gt;
The last course-grained bottleneck was the dcache_lock. It ate up a adequate amount of time in normal use. But it was also called in the much more popular dnotify_parent() function, which made it unacceptable. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
One of the goals for the library is to have low startup costs for threads so that scalability is possible. The biggest problem, time-wise, outside the kernel is the memory needed for the thread data structures, thread-local storage, and the stack. This is corrected by optimizing the memory allocation for the threads.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;Working on this section&amp;gt; [[hirving]]&#039;&#039;&#039;&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency. --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC) --[[Spanke|Shane]]: &#039;&#039;&#039;Revised it a bit, scheduling priority does not use number range in Linux.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3325</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3325"/>
		<updated>2010-10-13T19:15:32Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. No changes made as of yet.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them):&lt;br /&gt;
&lt;br /&gt;
There can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is comparatively expensive. Once misplaced information that causes this problem is identified it can be moved to limit the problem.&lt;br /&gt;
&lt;br /&gt;
There can also be some user-called locks that contribute to bottlenecks. One such lock is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, leading to starvation. This problem was solved by using a lockless-read.&lt;br /&gt;
&lt;br /&gt;
The multiqueue scheduler is the third major bottle neck. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time. Whilem, the rest went into computing and recomputing information in the cache. These problems were fixed by replacing the scheduler,. The scheduler was then replaced by a more efficient scheduler [O(1) scheduler].&lt;br /&gt;
&lt;br /&gt;
The next few bottle necks are related. They&#039;re both examples of course-granularity locks eating CPU time. Granularity refers to the execution time of a code segment. The closer a segment is to the speed of an atomic action the finer its granularity.&lt;br /&gt;
&lt;br /&gt;
One big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottle necks.&lt;br /&gt;
&lt;br /&gt;
The last course-grained bottleneck was the dcache_lock. It ate up a adequate amount of time in normal use. But it was also called in the much more popular dnotify_parent() function, which made it unacceptable. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
One of the goals for the library is to have low startup costs for threads so that scalability is possible. The biggest problem, time-wise, outside the kernel is the memory needed for the thread data structures, thread-local storage, and the stack. This is corrected by optimizing the memory allocation for the threads.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;Working on this section&amp;gt; [[hirving]]&#039;&#039;&#039;&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency. --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC) [[Shane]]: &#039;&#039;&#039;Revising it a bit Linux does not use the same functionality as Windows.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3324</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3324"/>
		<updated>2010-10-13T19:15:02Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. No changes made as of yet.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them):&lt;br /&gt;
&lt;br /&gt;
There can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is comparatively expensive. Once misplaced information that causes this problem is identified it can be moved to limit the problem.&lt;br /&gt;
&lt;br /&gt;
There can also be some user-called locks that contribute to bottlenecks. One such lock is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, leading to starvation. This problem was solved by using a lockless-read.&lt;br /&gt;
&lt;br /&gt;
The multiqueue scheduler is the third major bottle neck. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time. Whilem, the rest went into computing and recomputing information in the cache. These problems were fixed by replacing the scheduler,. The scheduler was then replaced by a more efficient scheduler [O(1) scheduler].&lt;br /&gt;
&lt;br /&gt;
The next few bottle necks are related. They&#039;re both examples of course-granularity locks eating CPU time. Granularity refers to the execution time of a code segment. The closer a segment is to the speed of an atomic action the finer its granularity.&lt;br /&gt;
&lt;br /&gt;
One big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottle necks.&lt;br /&gt;
&lt;br /&gt;
The last course-grained bottleneck was the dcache_lock. It ate up a adequate amount of time in normal use. But it was also called in the much more popular dnotify_parent() function, which made it unacceptable. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
One of the goals for the library is to have low startup costs for threads so that scalability is possible. The biggest problem, time-wise, outside the kernel is the memory needed for the thread data structures, thread-local storage, and the stack. This is corrected by optimizing the memory allocation for the threads.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;Working on this section&amp;gt; [[hirving]]&#039;&#039;&#039;&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency. --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC) [Shane]: Revising it a bit Linux does not use the same functionality as Windows.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;diff=3322</id>
		<title>Talk:COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;diff=3322"/>
		<updated>2010-10-13T19:12:35Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Sources */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== The Question ==&lt;br /&gt;
&#039;&#039;&#039;Original:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Rannath:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The question seems to be about number and scalability of threads not the gross mechanics.&lt;br /&gt;
&lt;br /&gt;
To be more clear: we can limit ourselves from the thread implementations to the thread scalability... ignore the stuff that required for all threads, unless its required for many threads. (I didn&#039;t find any implementations that required hardware)&lt;br /&gt;
&lt;br /&gt;
I would also argue that since OSs have to run on multiple hardwares one cannot guarantee that unique/rare hardware bits will be there. While we can talk about hardware we should limit it to a mention at most. OR we could mention prospective hardware that could help out, but is not yet standard. It depends on whether we want to do &amp;quot;as it is&amp;quot; or &amp;quot;as it might be&amp;quot;&lt;br /&gt;
&lt;br /&gt;
utility of such massively scalable thread implementations. I took this as: what functionality (of single strings) does one have to give up to make threads scalable.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Gautam:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
I think the hardware is as relevant as the software. Not all things can be done in software and hardware support is an important factor in most of the solutions to many problems that OS face. My take.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Henry:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
Since the question is about the system as a whole, I think the answer should include both software and hardware support for large amounts of threads. The questions revolves around how a system can handle millions of threads and what are the major factors that allow the system to do it. Also, the last part of the question seems to ask what this amount of threads allows a process to do.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Shane:&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
In response to the above&#039;s idea on the last part of the question, I would argue that it would enable fast execution because all threads that receive a cache miss would be picked up by the other threads so long as there was enough resources. Also the use of more threads would help synchronize the cache (through sharing) so that it would not miss. Of course this would be if they were assigned to the same task, you cannot sync threads running different applications it just wouldn&#039;t make sense. The only issue with this idea is the software must support this number.&lt;br /&gt;
&lt;br /&gt;
== Group 7 ==&lt;br /&gt;
&lt;br /&gt;
Let us start out by listing down our names and email id (preffered). &lt;br /&gt;
&lt;br /&gt;
Gautam Akiwate         &amp;lt;gautam.akiwate@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Patrick Young(rannath) &amp;lt;rannath@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
vG Vivek &amp;lt;support.tamiltreasure@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Shane Panke &amp;lt;shanepanke@msn.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Henry Irving &amp;lt;sens.henry@gmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Paul Raubic &amp;lt;paul_raubic@hotmail.com&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Guidelines ==&lt;br /&gt;
&lt;br /&gt;
Raw info should have some indication of where you got it for citation.&lt;br /&gt;
&lt;br /&gt;
Claim your info so we don&#039;t need to dig for who got what when we need clarification.&lt;br /&gt;
&lt;br /&gt;
Feel free to provide info for or edit someone else&#039;s info, just keep their signature so we can discuss changes&lt;br /&gt;
&lt;br /&gt;
sign changes (once) preferably without time stamps Ex: --[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
Please maintain a log of your activities in the Log Section. So that we can keep track of the evolution of the essay. --[[User:Gautam|Gautam]]&lt;br /&gt;
&lt;br /&gt;
== Log ==&lt;br /&gt;
Please maintain a log of your activities in the Log Section. So that we can keep track of the evolution of the essay. --[[User:Gautam|Gautam]]&lt;br /&gt;
&lt;br /&gt;
Moved around some info for clarity&lt;br /&gt;
&lt;br /&gt;
everyone should post your interpretation of the question in simplest possible English so we`re on the same page (as someone, maybe me, seems to have the wrong idea about what we`re trying to talk about) &lt;br /&gt;
&lt;br /&gt;
More moving for clarity&lt;br /&gt;
added an essay outline at bottom (feel free to change)&lt;br /&gt;
filled in the outline somewhat&lt;br /&gt;
added questions to the outline for everyone to think on.--[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
First Draft for essay. Please modify and add on. --[[User:Gautam|Gautam]] 02:46, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;Add your future activities here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Facts We have ==&lt;br /&gt;
Start by placing the info here so we can sort through it. I&#039;m going to go into full research/essay writing mode on Sunday if there isn&#039;t enough here.&lt;br /&gt;
&lt;br /&gt;
So far we have:&lt;br /&gt;
Three design choices I&#039;ve seen:&lt;br /&gt;
# Smallest possible footprint per-thread (being extremely light weight) - from everywhere&lt;br /&gt;
# least number (none if at all possible) of context switches per-thread - &#039;&#039;5&#039;&#039;&lt;br /&gt;
# use of a &amp;quot;thread pool&amp;quot; - &#039;&#039;3&#039;&#039;&lt;br /&gt;
The idea is to reduce processor time and storage needed per-thread so you can have more in the same amount of space. --[[User:Rannath|Rannath]]&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
These are all related ideas.&lt;br /&gt;
&lt;br /&gt;
Ok, since we are discussing design choices maybe we could also elaborate on the two major types of threads. Here, I already wrote a few lines, source can be found in citation section: &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Fibers (user mode threads) provide very quick and efficient switching because there is no need for a system call and kernel is oblivious to a switch - allows for millions of user mode threads. ISSUES: Blocking system calls disables all other fibers.&lt;br /&gt;
On the other hand managing threads through the kernel requires context switch (between user and kernel mode) on creation and removal of a thread therefore programs with prodigious number of threads would suffer huge performance hits.--[[User:Praubic|Praubic]] 18:05, 10 October 2010 (UTC)&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
User-mode scheduling (UMS) is a light-weight mechanism that applications can use to schedule their own threads. The ability to switch between threads in user mode makes UMS more efficient than thread pools for short-duration work items that require few system calls. [[Paul]]&lt;br /&gt;
&lt;br /&gt;
One implementation of UMS is: combination of N:N and N:M, where the N:N relationship reveals N false processors to the user-space so the user can deal with scheduling on their own. &#039;&#039;5&#039;&#039; -[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
I would scrap the first two below, at most mention them...&lt;br /&gt;
&lt;br /&gt;
#time-division multiplexing&lt;br /&gt;
#threads vs processes&lt;br /&gt;
#I/O Scheduling -[[vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Splitting this off because I don&#039;t think it&#039;s technically part of the answer&amp;lt;br&amp;gt;&lt;br /&gt;
Multithreading generally occurs by time-division multiplexing. It makes it possible for the processor to switch between different threads but it happens so fast that the user sees it as it is running at the same time. [[User:vG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
Things that we &#039;&#039;&#039;need&#039;&#039;&#039; to cover in the essay:--[[User:Gautam|Gautam]] 19:35, 7 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
This is a &#039;&#039;&#039;need&#039;&#039;&#039; section 4 below is not &#039;&#039;&#039;needed&#039;&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
(A)Design Decisions &lt;br /&gt;
   1. Type of threading (1:1 1:N M:N)&lt;br /&gt;
   2. Signal handling - we might be able to leave this out as it seems some &amp;quot;light weight&amp;quot; threads use no signals&lt;br /&gt;
   3. Synchronisation&lt;br /&gt;
   4. Memory Handling&lt;br /&gt;
   5. Scheduling Priorities (context switching and how it affects the CPU threading process)[[Paul]]&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Things we might want also to cover in the essay (non-essentials here): --[[User:Rannath|Rannath]] 04:43, 10 October 2010 (UTC)&amp;lt;br&amp;gt;&lt;br /&gt;
(A)Design Decisions &lt;br /&gt;
   1. Brief History of threading&lt;br /&gt;
   2. examples of attempts at getting absurd numbers of threads (failures)&lt;br /&gt;
   3. other types of threading, including heavy weight and processes&lt;br /&gt;
   4. Examples of systems that require many threads such as mainframe servers or banking client processing.--[[User:Praubic|Praubic]] 17:34, 11 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Here is an example of a design: (the topic asks for key design choices here is one)&lt;br /&gt;
&lt;br /&gt;
Capriccio is a specific design for scalable user level threads. They are distinct from most designs by being independent of event based mechanisms as well as kernel thread models. They are very good choice for internet servers and this implementations could easily support 100,000 threads. They are characterized by high scalability, efficient stack management and scheduling based on resource usage however the performance is not comparable to event-based systems.--[[User:Praubic|Praubic]] 13:32, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
(B)Kernel &lt;br /&gt;
   1. Program Thread manipulation through system calls --[[User:Hirving|Hirving]] 20:05, 7 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
(C)Hardware --[[User:Hirving|Hirving]] 19:55, 7 October 2010 (UTC)&lt;br /&gt;
   1. Simultaneous Multithreading&lt;br /&gt;
   2. Multi-core processors&lt;br /&gt;
&lt;br /&gt;
== Essay Outline ==&lt;br /&gt;
&lt;br /&gt;
#Thesis is an answer to the question so... that&#039;s the first step, or the last step, we can always present our info and make our thesis match the info.&lt;br /&gt;
#List all questions and points we have about the topic&lt;br /&gt;
&lt;br /&gt;
Questions:&lt;br /&gt;
#What makes threads non-scalable? List the problems&lt;br /&gt;
#What utility do some scalable implementations lack? Why?&lt;br /&gt;
#Just how scalable does a full utility implementation get?&lt;br /&gt;
&lt;br /&gt;
Answers:&lt;br /&gt;
# &lt;br /&gt;
# Signals, portability(maybe) both add overhead which would slow down threads&lt;br /&gt;
#&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Intro (fill in info)&lt;br /&gt;
# Thesis&lt;br /&gt;
# main topics &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Body (made of many main points)&lt;br /&gt;
&lt;br /&gt;
Main Point 1 -[[Rannath]]&amp;lt;br&amp;gt;&lt;br /&gt;
- efficient thread creation/destruction is more scalable&amp;lt;br&amp;gt;&lt;br /&gt;
-- NPTL&#039;s improvements over LinuxThreads- primarily due to lower overhead of creation/destruction &#039;&#039;1&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 2 -[[Rannath]]&amp;lt;br&amp;gt;&lt;br /&gt;
- UMS &amp;amp; user-space threads are more scalable - maybe&amp;lt;br&amp;gt;&lt;br /&gt;
-- context switches are costly &#039;&#039;From class&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- blocking locks have lower latency when twinned with a user space scheduler &#039;&#039;8&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 3&amp;lt;br&amp;gt;&lt;br /&gt;
- Certain bottleneck appear in scaled implementations, removing these improves scalability.&amp;lt;br&amp;gt;&lt;br /&gt;
-- &amp;quot;False cache-line sharing&amp;quot; &#039;&#039;14&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- xtime lock to a lockless lock &#039;&#039;14&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Main Point 3.5&amp;lt;br&amp;gt;&lt;br /&gt;
Fine-Grain over course-grain&amp;lt;br&amp;gt;&lt;br /&gt;
-- &amp;quot;Big Kernel Lock&amp;quot; &#039;&#039;14&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
-- dcache_lock &#039;&#039;14&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Link the Main points to the thesis&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
Conclusion&lt;br /&gt;
# restate info&lt;br /&gt;
# affirmation of thesis&lt;br /&gt;
&lt;br /&gt;
Here is the first paragraph that I attempted. Please feel free to change or even delete it from here. &lt;br /&gt;
&lt;br /&gt;
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable User threads on the other hand are mapped to kernel threads by the threads library such as libpthreads. and there are a few designs that incorporate it mainly Fibers and UMS (User Mode Scheudling) which allow for very high scalability.  UMS threads have their own context and resources however the ability to switch in the user mode makes them more efficient (depending on  application) than Thread Pools which are yet another mechanism that allows for high scalability.&lt;br /&gt;
--[[User:Praubic|Praubic]] 19:04, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
I suggest that we start filling out the main points of the essay. We can discuss the intricacies as we go along. --[[User:Gautam|Gautam]] 02:46, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== Sources ==&lt;br /&gt;
&lt;br /&gt;
# Short history of threads in Linux and new implementation of them. [http://www.drdobbs.com/open-source/184406204;jsessionid=3MRSO5YMO1QVRQE1GHRSKHWATMY32JVN NPTL: The New Implementation of Threads for Linux ] [[User:Gautam|Gautam]] 22:18, 5 October 2010 (UTC)&lt;br /&gt;
# This paper discusses the design choices [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.6590&amp;amp;rep=rep1&amp;amp;type=pdf Native POSIX Threads] [[User:Gautam|Gautam]] 22:11, 5 October 2010 (UTC)&lt;br /&gt;
# lightweight threads vs kernel threads [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.9043&amp;amp;rep=rep1&amp;amp;type=pdf PicoThreads: Lightweight Threads in Java] --[[User:Rannath|Rannath]] 00:23, 6 October 2010 (UTC)&lt;br /&gt;
# [http://eigenclass.org/http://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_7&amp;amp;action=edit&amp;amp;section=7hiki/lightweight-threads-with-lwt Eigenclass Comparing lightweight threads] --[[User:Rannath|Rannath]] 00:23, 6 October 2010 (UTC)&lt;br /&gt;
# A lightwight thread implementation for Unix [http://www.usenix.org/publications/library/proceedings/sa92/stein.pdf Implementing light weight threads] --[[User:Rannath|Rannath]] 00:49, 6 October 2010 (UTC) [[User:Gbint|Gbint]] 19:50, 5 October 2010 (UTC)&lt;br /&gt;
#Not in this group, but I thought that this paper was excellent: [http://www.sandia.gov/~rcmurph/doc/qt_paper.pdf Qthreads: An API for Programming with Millions of Lightweight Threads]&lt;br /&gt;
# Difference between single and multi threading [http://wiki.answers.com/Q/Single_threaded_Process_and_Multi-threaded_Process] [[vG]]&lt;br /&gt;
# [http://hdl.handle.net/1853/6804 Implementation of Scalable Blocking Locks using an Adaptative Thread Scheduler] --[[User:Gautam|Gautam]] 19:35, 7 October 2010 (UTC)&lt;br /&gt;
# Research Group working on Simultaneous Multithreading [http://www.cs.washington.edu/research/smt/ Simultaneous Multithreading] --[[User:Hirving|Hirving]] 19:58, 7 October 2010 (UTC)&lt;br /&gt;
# This site provides in-depth info about threads, threads-pooling, scheduling: http://msdn.microsoft.com/en-us/library/ms684841(VS.85).aspx [[Paul]]&lt;br /&gt;
# Here is another site that outlines THREAD designs and techniques: http://people.csail.mit.edu/rinard/osnotes/h2.html [[Paul]]&lt;br /&gt;
# [http://www.cosc.brocku.ca/Offerings/4P13/slides/threads.ppt Interesting presentation: really worth checking out]  [[Paul]]&lt;br /&gt;
# KERNEL vs USERMODE http://www.wordiq.com/definition/Thread_(computer_science)--[[User:Praubic|Praubic]] 18:06, 10 October 2010 (UTC)&lt;br /&gt;
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.7621&amp;amp;rep=rep1&amp;amp;type=pdf#page=83 Scalability in linux]&lt;br /&gt;
# [http://hillside.net/plop/2007/papers/PLoP2007_Ahluwalia.pdf This has something to do with our question...]&lt;br /&gt;
# [http://msdn.microsoft.com/en-us/library/ms685100%28VS.85%29.aspx Scheduling Priorities (Windows)], Microsoft (23 September 2010) --[[User:Spanke|Shane]]&lt;br /&gt;
# [http://www.novell.com/coolsolutions/feature/14878.html Linux Scheduling Priorities Explained], Novell (11 October 2005) --[[User:Spanke|Shane]]&lt;br /&gt;
# [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/ Inside the Linux 2.6 Completely Fair Scheduler], IBM (15 December 2009) --[[User:Spanke|Shane]]&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3321</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3321"/>
		<updated>2010-10-13T19:11:03Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. No changes made as of yet.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them):&lt;br /&gt;
&lt;br /&gt;
There can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is comparatively expensive. Once misplaced information that causes this problem is identified it can be moved to limit the problem.&lt;br /&gt;
&lt;br /&gt;
There can also be some user-called locks that contribute to bottlenecks. One such lock is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, leading to starvation. This problem was solved by using a lockless-read.&lt;br /&gt;
&lt;br /&gt;
The multiqueue scheduler is the third major bottle neck. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time. Whilem, the rest went into computing and recomputing information in the cache. These problems were fixed by replacing the scheduler,. The scheduler was then replaced by a more efficient scheduler [O(1) scheduler].&lt;br /&gt;
&lt;br /&gt;
The next few bottle necks are related. They&#039;re both examples of course-granularity locks eating CPU time. Granularity refers to the execution time of a code segment. The closer a segment is to the speed of an atomic action the finer its granularity.&lt;br /&gt;
&lt;br /&gt;
One big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottle necks.&lt;br /&gt;
&lt;br /&gt;
The last course-grained bottleneck was the dcache_lock. It ate up a adequate amount of time in normal use. But it was also called in the much more popular dnotify_parent() function, which made it unacceptable. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
One of the goals for the library is to have low startup costs for threads so that scalability is possible. The biggest problem, time-wise, outside the kernel is the memory needed for the thread data structures, thread-local storage, and the stack. This is corrected by optimizing the memory allocation for the threads.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;Working on this section&amp;gt; [[hirving]]&#039;&#039;&#039;&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows (and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux). All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency. --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3319</id>
		<title>COMP 3000 Essay 1 2010 Question 7</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_7&amp;diff=3319"/>
		<updated>2010-10-13T19:01:37Z</updated>

		<summary type="html">&lt;p&gt;Spanke: /* Design Choices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability.  UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Taken the liberty to add Praubic&#039;s tentative first para. No changes made as of yet.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them):&lt;br /&gt;
&lt;br /&gt;
There can be some instances of misplaced information in the cache that can cause a &amp;quot;cache-coherency operation&amp;quot; to be called. This operation is comparatively expensive. Once misplaced information that causes this problem is identified it can be moved to limit the problem.&lt;br /&gt;
&lt;br /&gt;
There can also be some user-called locks that contribute to bottlenecks. One such lock is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, leading to starvation. This problem was solved by using a lockless-read.&lt;br /&gt;
&lt;br /&gt;
The multiqueue scheduler is the third major bottle neck. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time. Whilem, the rest went into computing and recomputing information in the cache. These problems were fixed by replacing the scheduler,. The scheduler was then replaced by a more efficient scheduler [O(1) scheduler].&lt;br /&gt;
&lt;br /&gt;
The next few bottle necks are related. They&#039;re both examples of course-granularity locks eating CPU time. Granularity refers to the execution time of a code segment. The closer a segment is to the speed of an atomic action the finer its granularity.&lt;br /&gt;
&lt;br /&gt;
One big course-grained bottleneck in the system is the &amp;quot;Big Kernel Lock&amp;quot; (BKL) linux&#039;s kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL&#039;s usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottle necks.&lt;br /&gt;
&lt;br /&gt;
The last course-grained bottleneck was the dcache_lock. It ate up a adequate amount of time in normal use. But it was also called in the much more popular dnotify_parent() function, which made it unacceptable. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Design Choices ==&lt;br /&gt;
&#039;&#039;&#039;(A) Kernel Threads and User Threads (1:1 vs M:N)&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(B)Signal Handling&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(C)Synchronization&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(D)Memory Management&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
One of the goals for the library is to have low startup costs for threads so that scalability is possible. The biggest problem, time-wise, outside the kernel is the memory needed for the thread data structures, thread-local storage, and the stack. This is corrected by optimizing the memory allocation for the threads.&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;Working on this section&amp;gt; [[hirving]]&#039;&#039;&#039;&lt;br /&gt;
&#039;&#039;&#039;&amp;lt;br&amp;gt;(E)Scheduling Priorities&amp;lt;br&amp;gt;&#039;&#039;&#039;&lt;br /&gt;
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks.  Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency. --[[User:Praubic|Praubic]] 18:24, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Spanke</name></author>
	</entry>
</feed>