Soma-notes - User contributions [en]

Talk:COMP 3000 Essay 2 2010 Question 3

2023-01-31T18:09:30Z

CFaibish: /* Group 3 Essay */

=Group 3 Essay=

Hello everyone, please post your contact information here:

Ben Robson [mailto:brobson@connect.carleton.ca brobson@connect.carleton.ca]

Rey Arteaga: rarteaga@connect.carleton.ca

Corey Faibish

Tawfic Abdul-Fatah: [mailto:tfatah@gmail.com tfatah@gmail.com]

Fangchen Sun: [mailto:sfangche@connect.carleton.ca sfangche@connect.carleton.ca]

Mike Preston: [mailto:michaelapreston@gmail.com michaelapreston@gmail.com]

Wesley L. Lawrence: [mailto:wlawrenc@connect.carleton.ca wlawrenc@connect.carleton.ca]

Can't access the video without a login as we found out in class, but you can listen to the speech and follow with the slides pretty easily, I just went through it and it's not too bad. Rarteaga

==Question 3 Group==
*Abdul-Fatah Tawfic tafatah
*Arteaga Reynaldo rarteaga
*Faibish Corey cfaibish
*Lawrence Wesley wlawrenc
*Preston Mike mpreston
*Robson Benjamin brobson
*Sun Fangchen sfangche

==Who is working on what ?==
Just to keep track of who's doing what --[[User:Tafatah|Tafatah]] 01:37, 15 November 2010 (UTC)

Hey everyone, I have taken the liberty of trying to provide a good first start on our paper. I have provided many resources and filled in information for all of the sections. This is not complete, but it should make the rest of the work a lot easier. Please go through and add in pieces that I am missing (specifically in the Critique section) and then we can put this essay to bed. Also, please note that below I have included my notes on the paper so that if anyone feels they do not have time to read the paper, they can read my notes instead and still find additional materials to contribute with.
--[[User:Mike Preston|Mike Preston]] 18:22, 20 November 2010 (UTC)

Man, Mike: you did a nice job! I'm reading through it now very thorough :) Since you pretty much turned all of your bulleted points from the discussion page into that on the main page, what else needs to be done? Just expanding on each topic and sub-topic? Or are there untouched concepts/topics that we should be addressing?
Oh and question two: Should we turn the Q&A from the end of the video of the presentation into information for the ''Critique'' section?
--[[User:CFaibish|CFaibish]] 20:34, 22 November 2010 (UTC)

Mike, thnx for the great job! I basically finished the part of related work based on your draft.
--[[User:sfangchen|Fangchen Sun]] 17:40, 24 November 2010 (UTC)

No problem, And great additions.
In terms of what needs to be done, I do believe that adding some detail to the critique is where we really need some focus. Using the Q&A from the video is probably a great source of inspiration, maybe just take a look at the topics presented, see if additional material from other sources can be obtain and use those sources to address any pros or cons to this artical. Remember, the critique section can be agreeing or disagreeing with what is presented in the actual paper.
--[[User:Mike Preston|Mike Preston]] 15:12, 28 November 2010 (UTC)

I noticied we needed some work in the Critique section, so I listened to the Q&A session at the end of the FlexSC mp3 talk, and took some quick notes. There seems to be 3 good ones (of the 9) that I picked out. I'll summarize them and add to the Critique section, specifically questions 3, 6, and 7. If anyone else wants to have listen to a specific question, and maybe try to do some more 'critiquing' here is a list of what time the questions each take place, and a very general statement on what the question is about, and the very general answer:
 1 - 22:30 Q: Did the paper consider Upstream patches(?) A:No
 2 - 23:00 Q: Security issues with the pages A:Pages pre-processor, no issue
 3 - 24:10 Q: What about blocking calls (read/write)? A: Not handled
 4 - 25:50 Q: ? A: Not a problem? (Personally didn't understand question, don't believe it's important, but anyone whose willing should double check)
 5 - 28:00 Q: Compare pollution between user thread switching to user-kernel thread switching? A: No, only looked at and measured pollution when switching user-to-kernel.
 6 - 29:30 Q: Scheduling problems of what cores are 'system' core, and what cores are 'user' cores A: Very simple algorithm, but not tested when running multiple apps, would need to be "fine-tuned"
 7 - 31:00 Q: Situations where FlexSC is bad, when running less or equal threads to the number of cores, such as "Scientific programs", mostly in userspace where one thread has 100% CPU resource A: Agrees, FlexSC is not to be used for such situations
 8 - 33:00 Q: Problems with un-related threads demanding service, how does it scale? Issue with frequency of polling could cause sys calls to take time to preform A: (Would be answered offline)
 9 - 34:30 Q: Backwards compatability and robustness A: Only an issue with getTID (Thread ID), needed a small patch.

--[[User:Wlawrence|Wesley Lawrence]] 20:31, 29 November 2010 (UTC)

Wrote information in Critique for questions 3, 6 and 7 (Blocking Calls, Core Scheduling Issues, and When There Are Not More Threads Then Cores). If you feel any additions need to be made, please feel free to add them. Most importantly, I'm not sure how to cite these. All information as obtained from the mp3 of the presentation, could some one let me know how to go about citing this?

--[[User:Wlawrence|Wesley Lawrence]] 21:05, 29 November 2010 (UTC)

I'm going to run through the whole paper, and just make sure everything makes sense, and fill in the holes where needed. I'll also add my own thoughts along the way. Feel free to do the same.-Rarteaga

Added 3 sections to the critique, definitions for the remaining terms (thanks Corey for taking care of some of these) and did some editing. My plan is to add some more flesh to the FlexSC-Threads section.
I'll do that sometime before 3PM on Thursday. I'll also go over the paper at that time in case something needs some editing. --[[User:Tafatah|Tafatah]] 06:38, 2 December 2010 (UTC)

I'm going to be working on the contributions section under Implementation and demonstrating some statistics they showed in the paper. Rarteaga

==Paper Summary==
I am not sure if everyone has taken the time to examine the paper closely, so I thought I would provide my notes on the paper so that anyone who has not read it could have a view of the high points.

Abstract:
- System calls are the accepted way to request services from the OS kernel, historical implementation.
- System calls almost always synchronous
- Aim to demonstrate how synchronous system calls negatively affect performance due mainly to pipeline flushing and pollution of key processor structures (TLB, data and instruction caches, etc.)
o TLB is translation lookaside buffer which is uses pages (data and code pages) to speed up virtual translation speed.
- Propose exception-less system calls to improve the current system call process.
o Improve processor efficiency via enabling flexible scheduling of OS work which in turn reduces size of execution both in kernel and user space thus reducing pollution effects on processor structures.
- Exception-less system calls especially effective on multi-core systems running multi-threaded applications.
- FlexSC is an implementation of exception-less system calls in the Linux kernel with accompanying user-mode threads from FlexSC-Threads package.
o Flex-SC-Threads convert legacy system calls into exception-less system calls.
Introduction:
- Synchronous system calls have a negative impact on system performance due to:
o Direct costs – mode switching
o Indirect costs – pollution of important processor structures
- Traditional system calls:
o Involve writing arguments to appropriate registers as well as issuing a special machine instruction which raises a synchronous exception.
o A processor exception is used to communicate with the kernel.
o Synchronous execution is enforced as the application expects the completion of the system call before user-mode execution resumes.
- Moore’s Law has provided large increases to performance potential of software while at the same time widening the gap between the performance of efficient and inefficient software.
o This gap is mainly caused by disparity of accessing different processor resources (registers, caches, memory)
- Server and system-intensive workloads are known to perform well below processor potential throughput.
o These are the items the researchers are mostly interested in.
o The cause is often described as due to the lack of locality.
o The researchers state this lack of locality is in part a result of the current synchronous system calls.
- When a synchronous system call, like pwrite, is called, the instruction per cycle level drops significantly and it takes many (in the example 14,000) cycles of execution for the instruction per cycle rate
to return to the level it was at before the system (pwrite) call.
- Exception-less System Call:
o Request for kernel services that does not require the use of synchronous processor exceptions.
o System calls are written to a reserved syscall page.
o Execution of system calls is performed asynchronously by special kernel level syscall threads. The result of the execution is stored on the syscall page after execution.
- By separating system call execution from system call invocation, the system can now have flexible system call scheduling.
o This allows system calls to be executed in batches, increasing the temporal locality of execution.
o Also provides a way to execute system calls on a separate core, in parallel to user-mode thread execution. This provides spatial per-core locality.
o An additional side effect is that now a multi-core system can have individual cores designated to run either user-mode or kernel mode execution dynamically depending on the current system load.
- In order to implement the exception-less system calls, the research team suggests adding a new M-on-N threading package.
o M user-mode threads executing on N kernel-visible threads.
o This would allow the threading package to harvest independent system calls, by switching threads, in user-mode, whenever a thread invokes a system call.
The (Real) Cost of System Calls
- Traditional way to measure the performance cost of system calls is the mode switch time. This is the time necessary to execute the system call instruction in user-mode, resume execution in kernel mode and
then return execution back to the user-mode.
- Mode switch in modern processors is a processor exception.
o Flush the user-mode pipeline, save registers onto the kernel stack, change the protection domain and redirect execution to the proper exception handler.
- Another measure of the performance of a system call is the state pollution caused by the system call.
o State pollution is the measure of how much user-mode data is overwritten in places like the TLB, cache (L1, L2, L3), branch prediction tables with kernel leel execution instructions for the system call.
o This data must be re-populated upon the return to user-mode.
- Potentially the most significant measure of cost of system calls is the performance impact on a running application.
o Ideally, user-mode instructions per cycle should not decrease as a result of a system call.
o Synchronous system calls do cause a drop in user-mode IPC due to; direct overhead - the processor exception associated with the system call which flushes the processor pipeline; and indirect overhead
– system call pollution on processors structures.
Exception-less System calls:
- System call batching
o By delaying a series of system calls and executing them in batches you can minimize the frequency of mode switches between user and kernel mode.
o Improves both the direct and indirect cost of system calls.
- Core specialization
o A system call can be scheduled on a different core then the core on which it was invoked, only for exception-less system calls.
o Provides ability to designate a core to run all system calls.
- Exception-less Syscall Interface
o Set of memory pages shared between user and kernel modes. Referred to as Syscall pages.
o User-space threads find a free entry in a syscall page and place a request for a system call. The user-space thread can then continue executing without interruption and must then return to the syscall
page to get the return value from the system call.
o Neither issuing the system call (via the syscall page) nor getting the return value generate an exception.
- Syscall pages
o Each page is a table of syscall entries.
o Each syscall entre has a state:
 Free – means a syscall can be added her
 Submitted – means the kernel can proceed to invoke the appropriate system call operations.
 Done – means the kernel is finished and has provided the return value to the syscall entry. User space thread must return and get the return value from the page.
- Decoupling Execution from Invocation
o To separate these two concepts a special kernel thread, syscall thread, is used.
o Sole purpose is to pull requests from syscall pages and execute them always in kernel mode.
o Syscall threads provide the ability to schedule the system calls on specific cores.
System Calls Galore – FlexSC-Threads
- Programming for exception-less system calls requires a different and more complex way of interacting with the kernel for OS functionality.
o The researchers describe working with exception-less system calls as being similar to event-driven programming in that you do not get the same sequential execution of code as you do with synchronous
system calls.
o In event-driven servers, the researchers suggest using a hybrid of both exception-less system calls (for performance critical paths) and regular synchronous system calls (for less critical system calls).
FlexSC-Threads
- Threading package which transforms synchronous system calls into exception-less system calls.
- Intended use is with server-type applications with which have many user-mode threads (like Apache or MySQL).
- Compatible with both POSIX threads and the default Linux thread library.
o As a result, multi-threaded Linux programs are immediately compatible with FlexSC threads without modification.
- For multi-core systems, a single kernel level thread is created for each core of the system. Multiple user-mode threads are multiplexed onto each kernel level thread via interactions with the syscall pages.
o The syscall pages are private to each kernel level thread, this means each core of a system has a syscall page from which it will receive system calls.
Overhead:
- When running a single exception-less system call against a single synchronous system call, the exception-less call was slower.
- When running a batch of exception-less system calls compared to a bunch of synchronous system calls, the exception-less system calls were much faster.
- The same is true for a remote server situation, one synchronous call is much faster than one exception-less system call but a batch of exception-less system calls is faster than the same number
of synchronous system calls.
Related Work:
- System Call Batching
o Operating systems have a concept called multi-calls which involves collecting multiple system calls and submitting them as a single system call.
o The Cassyopia compiler has an additional process called a looped multi-call where the result of one system call can be fed as an argument to another system call in the same multi-call.
o Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls do.
 Multi-call system calls are executed sequentially, each one must complete before the next may start.
- Locality of Execution and Multicores
o Other techniques include Soft Timers and Lazy Receiver Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to
limit processor interference associated with interrupt handling without affecting the latency of servicing requests.
o Computation Spreading is another locality process which is similar to FlexSC.
 Processor modifications that allow hardware migration of threads and migration to specialized cores.
 Did not model TLBs and on current hardware synchronous thread migration is a costly interprocessor interrupt.
o Also have proposals for dedicating CPU cores to specific operating system functionality.
 These solutions require a microkernel system.
 Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically.
- Non-blocking Execution
o Past research on improving system call performance has focused on blocking versus non-blocking behaviour.
 Typically researchers used threading, event-based (non-blocking) and hybrid systems to obtain high performance on server applications.
o Main difference between past research and FlexSC is that none of the past proposals have decoupled system call execution from system call invocation.
--[[User:Mike Preston|Mike Preston]] 04:03, 20 November 2010 (UTC)

COMP 3000 Essay 2 2010 Question 3

2010-12-02T03:37:01Z

CFaibish: /* Linux Application Binary Interface (ABI) */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

I did some of these, some I don't think I can adequately explain, or just have no idea what they are, so I left them. --[[User:CFaibish|CFaibish]] 00:31, 2 December 2010 (UTC)

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation.

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T01:26:11Z

CFaibish: /* Exception-Less System Calls */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

I did some of these, some I don't think I can adequately explain, or just have no idea what they are, so I left them. --[[User:CFaibish|CFaibish]] 00:31, 2 December 2010 (UTC)

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86 binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T01:24:17Z

CFaibish: /* IO */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

I did some of these, some I don't think I can adequately explain, or just have no idea what they are, so I left them. --[[User:CFaibish|CFaibish]] 00:31, 2 December 2010 (UTC)

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86 binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T00:31:04Z

CFaibish: /* Instructions Per Cycle (IPC) */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

I did some of these, some I don't think I can adequately explain, or just have no idea what they are, so I left them. --[[User:CFaibish|CFaibish]] 00:31, 2 December 2010 (UTC)

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86 binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===Not Done Yet===
The following will be finalized on Wed. --[[User:Tafatah|Tafatah]] 02:26, 1 December 2010 (UTC) 
 START 
Good 
Most of the work is done transparently. i.e. there is no need for application 
code modification. 

Not for IO intensive tasks 

Small kernel change (3 lines of code as per section 3.2). That means adopters, and 
after each update of the kernel, would have to add/modify the referenced lines and 
then recompile the kernel 

For a multicore system (section 3, mostly 3.5), the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running syscall threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 

Further, the paper mentions that a predefined, static list of cores is used for syscall threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. 
 
Scaling with increased cores is ambiguous 
Further, it is not that clear how scalable the scheduler is. One gets the impression that it is very scalable due to the fact that each core 
spawns a syscall thread. Thus, as many threads as there are cores could be running concurrently, including for the same process. More explicit 
results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis of the results. 
Understandable, however, it would be nice to know if these threads ( 2 per core ) would actually be treated as a core when turned on ? I.e. would 
the scheduler then realize that it can use eight cores ? Does that also mean the predefined static cores list would have to be modified ? 
 
No testing with GPU 
Given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
effects when using specialized GPUs, like NVIDIA's Tesla GPUs for example. 
 END 

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/] and Linux application page. [http://www.linux.org/apps/AppId_8088.html]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T00:29:02Z

CFaibish: /* Latency */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)
===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86 binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===Not Done Yet===
The following will be finalized on Wed. --[[User:Tafatah|Tafatah]] 02:26, 1 December 2010 (UTC) 
 START 
Good 
Most of the work is done transparently. i.e. there is no need for application 
code modification. 

Not for IO intensive tasks 

Small kernel change (3 lines of code as per section 3.2). That means adopters, and 
after each update of the kernel, would have to add/modify the referenced lines and 
then recompile the kernel 

For a multicore system (section 3, mostly 3.5), the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running syscall threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 

Further, the paper mentions that a predefined, static list of cores is used for syscall threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. 
 
Scaling with increased cores is ambiguous 
Further, it is not that clear how scalable the scheduler is. One gets the impression that it is very scalable due to the fact that each core 
spawns a syscall thread. Thus, as many threads as there are cores could be running concurrently, including for the same process. More explicit 
results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis of the results. 
Understandable, however, it would be nice to know if these threads ( 2 per core ) would actually be treated as a core when turned on ? I.e. would 
the scheduler then realize that it can use eight cores ? Does that also mean the predefined static cores list would have to be modified ? 
 
No testing with GPU 
Given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
effects when using specialized GPUs, like NVIDIA's Tesla GPUs for example. 
 END 

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/] and Linux application page. [http://www.linux.org/apps/AppId_8088.html]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T00:23:10Z

CFaibish: /* References: */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)
===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86 binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===Not Done Yet===
The following will be finalized on Wed. --[[User:Tafatah|Tafatah]] 02:26, 1 December 2010 (UTC) 
 START 
Good 
Most of the work is done transparently. i.e. there is no need for application 
code modification. 

Not for IO intensive tasks 

Small kernel change (3 lines of code as per section 3.2). That means adopters, and 
after each update of the kernel, would have to add/modify the referenced lines and 
then recompile the kernel 

For a multicore system (section 3, mostly 3.5), the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running syscall threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 

Further, the paper mentions that a predefined, static list of cores is used for syscall threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. 
 
Scaling with increased cores is ambiguous 
Further, it is not that clear how scalable the scheduler is. One gets the impression that it is very scalable due to the fact that each core 
spawns a syscall thread. Thus, as many threads as there are cores could be running concurrently, including for the same process. More explicit 
results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis of the results. 
Understandable, however, it would be nice to know if these threads ( 2 per core ) would actually be treated as a core when turned on ? I.e. would 
the scheduler then realize that it can use eight cores ? Does that also mean the predefined static cores list would have to be modified ? 
 
No testing with GPU 
Given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
effects when using specialized GPUs, like NVIDIA's Tesla GPUs for example. 
 END 

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/] and Linux application page. [http://www.linux.org/apps/AppId_8088.html]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T00:17:33Z

CFaibish: /* NPTL */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)
===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86 binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===Not Done Yet===
The following will be finalized on Wed. --[[User:Tafatah|Tafatah]] 02:26, 1 December 2010 (UTC) 
 START 
Good 
Most of the work is done transparently. i.e. there is no need for application 
code modification. 

Not for IO intensive tasks 

Small kernel change (3 lines of code as per section 3.2). That means adopters, and 
after each update of the kernel, would have to add/modify the referenced lines and 
then recompile the kernel 

For a multicore system (section 3, mostly 3.5), the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running syscall threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 

Further, the paper mentions that a predefined, static list of cores is used for syscall threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. 
 
Scaling with increased cores is ambiguous 
Further, it is not that clear how scalable the scheduler is. One gets the impression that it is very scalable due to the fact that each core 
spawns a syscall thread. Thus, as many threads as there are cores could be running concurrently, including for the same process. More explicit 
results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis of the results. 
Understandable, however, it would be nice to know if these threads ( 2 per core ) would actually be treated as a core when turned on ? I.e. would 
the scheduler then realize that it can use eight cores ? Does that also mean the predefined static cores list would have to be modified ? 
 
No testing with GPU 
Given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
effects when using specialized GPUs, like NVIDIA's Tesla GPUs for example. 
 END 

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/] and Linux application page. [http://www.linux.org/apps/AppId_8088.html]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T00:17:26Z

CFaibish: /* References: */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)
===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86 binaries on Linux.[18]

===NPTL ===

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===Not Done Yet===
The following will be finalized on Wed. --[[User:Tafatah|Tafatah]] 02:26, 1 December 2010 (UTC) 
 START 
Good 
Most of the work is done transparently. i.e. there is no need for application 
code modification. 

Not for IO intensive tasks 

Small kernel change (3 lines of code as per section 3.2). That means adopters, and 
after each update of the kernel, would have to add/modify the referenced lines and 
then recompile the kernel 

For a multicore system (section 3, mostly 3.5), the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running syscall threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 

Further, the paper mentions that a predefined, static list of cores is used for syscall threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. 
 
Scaling with increased cores is ambiguous 
Further, it is not that clear how scalable the scheduler is. One gets the impression that it is very scalable due to the fact that each core 
spawns a syscall thread. Thus, as many threads as there are cores could be running concurrently, including for the same process. More explicit 
results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis of the results. 
Understandable, however, it would be nice to know if these threads ( 2 per core ) would actually be treated as a core when turned on ? I.e. would 
the scheduler then realize that it can use eight cores ? Does that also mean the predefined static cores list would have to be modified ? 
 
No testing with GPU 
Given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
effects when using specialized GPUs, like NVIDIA's Tesla GPUs for example. 
 END 

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/] and Linux application page. [http://www.linux.org/apps/AppId_8088.html]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T00:11:27Z

CFaibish: /* Linux ABI */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)
===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86 binaries on Linux.[18]

===NPTL ===

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===Not Done Yet===
The following will be finalized on Wed. --[[User:Tafatah|Tafatah]] 02:26, 1 December 2010 (UTC) 
 START 
Good 
Most of the work is done transparently. i.e. there is no need for application 
code modification. 

Not for IO intensive tasks 

Small kernel change (3 lines of code as per section 3.2). That means adopters, and 
after each update of the kernel, would have to add/modify the referenced lines and 
then recompile the kernel 

For a multicore system (section 3, mostly 3.5), the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running syscall threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 

Further, the paper mentions that a predefined, static list of cores is used for syscall threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. 
 
Scaling with increased cores is ambiguous 
Further, it is not that clear how scalable the scheduler is. One gets the impression that it is very scalable due to the fact that each core 
spawns a syscall thread. Thus, as many threads as there are cores could be running concurrently, including for the same process. More explicit 
results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis of the results. 
Understandable, however, it would be nice to know if these threads ( 2 per core ) would actually be treated as a core when turned on ? I.e. would 
the scheduler then realize that it can use eight cores ? Does that also mean the predefined static cores list would have to be modified ? 
 
No testing with GPU 
Given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
effects when using specialized GPUs, like NVIDIA's Tesla GPUs for example. 
 END 

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/] and Linux application page. [http://www.linux.org/apps/AppId_8088.html]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T00:11:16Z

CFaibish: /* References: */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)
===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux ABI ===

===NPTL ===

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===Not Done Yet===
The following will be finalized on Wed. --[[User:Tafatah|Tafatah]] 02:26, 1 December 2010 (UTC) 
 START 
Good 
Most of the work is done transparently. i.e. there is no need for application 
code modification. 

Not for IO intensive tasks 

Small kernel change (3 lines of code as per section 3.2). That means adopters, and 
after each update of the kernel, would have to add/modify the referenced lines and 
then recompile the kernel 

For a multicore system (section 3, mostly 3.5), the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running syscall threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 

Further, the paper mentions that a predefined, static list of cores is used for syscall threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. 
 
Scaling with increased cores is ambiguous 
Further, it is not that clear how scalable the scheduler is. One gets the impression that it is very scalable due to the fact that each core 
spawns a syscall thread. Thus, as many threads as there are cores could be running concurrently, including for the same process. More explicit 
results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis of the results. 
Understandable, however, it would be nice to know if these threads ( 2 per core ) would actually be treated as a core when turned on ? I.e. would 
the scheduler then realize that it can use eight cores ? Does that also mean the predefined static cores list would have to be modified ? 
 
No testing with GPU 
Given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
effects when using specialized GPUs, like NVIDIA's Tesla GPUs for example. 
 END 

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/] and Linux application page. [http://www.linux.org/apps/AppId_8088.html]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T00:03:40Z

CFaibish: /* TLB */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)
===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux ABI ===

===NPTL ===

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===Not Done Yet===
The following will be finalized on Wed. --[[User:Tafatah|Tafatah]] 02:26, 1 December 2010 (UTC) 
 START 
Good 
Most of the work is done transparently. i.e. there is no need for application 
code modification. 

Not for IO intensive tasks 

Small kernel change (3 lines of code as per section 3.2). That means adopters, and 
after each update of the kernel, would have to add/modify the referenced lines and 
then recompile the kernel 

For a multicore system (section 3, mostly 3.5), the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running syscall threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 

Further, the paper mentions that a predefined, static list of cores is used for syscall threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. 
 
Scaling with increased cores is ambiguous 
Further, it is not that clear how scalable the scheduler is. One gets the impression that it is very scalable due to the fact that each core 
spawns a syscall thread. Thus, as many threads as there are cores could be running concurrently, including for the same process. More explicit 
results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis of the results. 
Understandable, however, it would be nice to know if these threads ( 2 per core ) would actually be treated as a core when turned on ? I.e. would 
the scheduler then realize that it can use eight cores ? Does that also mean the predefined static cores list would have to be modified ? 
 
No testing with GPU 
Given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
effects when using specialized GPUs, like NVIDIA's Tesla GPUs for example. 
 END 

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T00:03:26Z

CFaibish: /* References: */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)
===TLB ===

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux ABI ===

===NPTL ===

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===Not Done Yet===
The following will be finalized on Wed. --[[User:Tafatah|Tafatah]] 02:26, 1 December 2010 (UTC) 
 START 
Good 
Most of the work is done transparently. i.e. there is no need for application 
code modification. 

Not for IO intensive tasks 

Small kernel change (3 lines of code as per section 3.2). That means adopters, and 
after each update of the kernel, would have to add/modify the referenced lines and 
then recompile the kernel 

For a multicore system (section 3, mostly 3.5), the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running syscall threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 

Further, the paper mentions that a predefined, static list of cores is used for syscall threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. 
 
Scaling with increased cores is ambiguous 
Further, it is not that clear how scalable the scheduler is. One gets the impression that it is very scalable due to the fact that each core 
spawns a syscall thread. Thus, as many threads as there are cores could be running concurrently, including for the same process. More explicit 
results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis of the results. 
Understandable, however, it would be nice to know if these threads ( 2 per core ) would actually be treated as a core when turned on ? I.e. would 
the scheduler then realize that it can use eight cores ? Does that also mean the predefined static cores list would have to be modified ? 
 
No testing with GPU 
Given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
effects when using specialized GPUs, like NVIDIA's Tesla GPUs for example. 
 END 

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498]

COMP 3000 Essay 2 2010 Question 3

2010-11-30T01:25:01Z

CFaibish: /* When There Are Not More Threads Then Cores */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode swtich which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation lookaside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpetedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendancy for the same set of data to be accessed repeatedly over a brief time period. There are two imprtant forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

COMP 3000 Essay 2 2010 Question 3

2010-11-30T01:24:02Z

CFaibish: /* Core Scheduling Issues */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode swtich which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation lookaside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpetedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendancy for the same set of data to be accessed repeatedly over a brief time period. There are two imprtant forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains.

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

COMP 3000 Essay 2 2010 Question 3

2010-11-30T01:08:52Z

CFaibish: /* Blocking Calls */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode swtich which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation lookaside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpetedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendancy for the same set of data to be accessed repeatedly over a brief time period. There are two imprtant forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so one. This seemed to be the most efficient algorithm for FlexSC scheduling, but, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be "fined tuned" for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains.

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

Talk:COMP 3000 Essay 2 2010 Question 3

2010-11-22T20:34:00Z

CFaibish: /* Who is working on what ? */

=Group 3 Essay=

Hello everyone, please post your contact information here:

Ben Robson [mailto:brobson@connect.carleton.ca brobson@connect.carleton.ca]

Rey Arteaga: rarteaga@connect.carleton.ca

Corey Faibish: [mailto:corey.faibish@gmail.com corey.faibish@gmail.com]

Tawfic Abdul-Fatah: [mailto:tfatah@gmail.com tfatah@gmail.com]

Fangchen Sun: [mailto:sfangche@connect.carleton.ca sfangche@connect.carleton.ca]

Mike Preston: [mailto:michaelapreston@gmail.com michaelapreston@gmail.com]

Wesley L. Lawrence: [mailto:wlawrenc@connect.carleton.ca wlawrenc@connect.carleton.ca]

Can't access the video without a login as we found out in class, but you can listen to the speech and follow with the slides pretty easily, I just went through it and it's not too bad. Rarteaga

==Question 3 Group==
*Abdul-Fatah Tawfic tafatah
*Arteaga Reynaldo rarteaga
*Faibish Corey cfaibish
*Lawrence Wesley wlawrenc
*Preston Mike mpreston
*Robson Benjamin brobson
*Sun Fangchen sfangche

==Who is working on what ?==
Just to keep track of who's doing what --[[User:Tafatah|Tafatah]] 01:37, 15 November 2010 (UTC)

Hey everyone, I have taken the liberty of trying to provide a good first start on our paper. I have provided many resources and filled in information for all of the sections. This is not complete, but it should make the rest of the work a lot easier. Please go through and add in pieces that I am missing (specifically in the Critique section) and then we can put this essay to bed. Also, please note that below I have included my notes on the paper so that if anyone feels they do not have time to read the paper, they can read my notes instead and still find additional materials to contribute with.
--[[User:Mike Preston|Mike Preston]] 18:22, 20 November 2010 (UTC)

Man, Mike: you did a nice job! I'm reading through it now very thorough :) Since you pretty much turned all of your bulleted points from the discussion page into that on the main page, what else needs to be done? Just expanding on each topic and sub-topic? Or are there untouched concepts/topics that we should be addressing?
Oh and question two: Should we turn the Q&A from the end of the video of the presentation into information for the ''Critique'' section?
--[[User:CFaibish|CFaibish]] 20:34, 22 November 2010 (UTC)

==Paper Summary==
I am not sure if everyone has taken the time to examine the paper closely, so I thought I would provide my notes on the paper so that anyone who has not read it could have a view of the high points.

Abstract:
- System calls are the accepted way to request services from the OS kernel, historical implementation.
- System calls almost always synchronous
- Aim to demonstrate how synchronous system calls negatively affect performance due mainly to pipeline flushing and pollution of key processor structures (TLB, data and instruction caches, etc.)
o TLB is translation lookaside buffer which is uses pages (data and code pages) to speed up virtual translation speed.
- Propose exception-less system calls to improve the current system call process.
o Improve processor efficiency via enabling flexible scheduling of OS work which in turn reduces size of execution both in kernel and user space thus reducing pollution effects on processor structures.
- Exception-less system calls especially effective on multi-core systems running multi-threaded applications.
- FlexSC is an implementation of exception-less system calls in the Linux kernel with accompanying user-mode threads from FlexSC-Threads package.
o Flex-SC-Threads convert legacy system calls into exception-less system calls.
Introduction:
- Synchronous system calls have a negative impact on system performance due to:
o Direct costs – mode switching
o Indirect costs – pollution of important processor structures
- Traditional system calls:
o Involve writing arguments to appropriate registers as well as issuing a special machine instruction which raises a synchronous exception.
o A processor exception is used to communicate with the kernel.
o Synchronous execution is enforced as the application expects the completion of the system call before user-mode execution resumes.
- Moore’s Law has provided large increases to performance potential of software while at the same time widening the gap between the performance of efficient and inefficient software.
o This gap is mainly caused by disparity of accessing different processor resources (registers, caches, memory)
- Server and system-intensive workloads are known to perform well below processor potential throughput.
o These are the items the researchers are mostly interested in.
o The cause is often described as due to the lack of locality.
o The researchers state this lack of locality is in part a result of the current synchronous system calls.
- When a synchronous system call, like pwrite, is called, the instruction per cycle level drops significantly and it takes many (in the example 14,000) cycles of execution for the instruction per cycle rate
to return to the level it was at before the system (pwrite) call.
- Exception-less System Call:
o Request for kernel services that does not require the use of synchronous processor exceptions.
o System calls are written to a reserved syscall page.
o Execution of system calls is performed asynchronously by special kernel level syscall threads. The result of the execution is stored on the syscall page after execution.
- By separating system call execution from system call invocation, the system can now have flexible system call scheduling.
o This allows system calls to be executed in batches, increasing the temporal locality of execution.
o Also provides a way to execute system calls on a separate core, in parallel to user-mode thread execution. This provides spatial per-core locality.
o An additional side effect is that now a multi-core system can have individual cores designated to run either user-mode or kernel mode execution dynamically depending on the current system load.
- In order to implement the exception-less system calls, the research team suggests adding a new M-on-N threading package.
o M user-mode threads executing on N kernel-visible threads.
o This would allow the threading package to harvest independent system calls, by switching threads, in user-mode, whenever a thread invokes a system call.
The (Real) Cost of System Calls
- Traditional way to measure the performance cost of system calls is the mode switch time. This is the time necessary to execute the system call instruction in user-mode, resume execution in kernel mode and
then return execution back to the user-mode.
- Mode switch in modern processors is a processor exception.
o Flush the user-mode pipeline, save registers onto the kernel stack, change the protection domain and redirect execution to the proper exception handler.
- Another measure of the performance of a system call is the state pollution caused by the system call.
o State pollution is the measure of how much user-mode data is overwritten in places like the TLB, cache (L1, L2, L3), branch prediction tables with kernel leel execution instructions for the system call.
o This data must be re-populated upon the return to user-mode.
- Potentially the most significant measure of cost of system calls is the performance impact on a running application.
o Ideally, user-mode instructions per cycle should not decrease as a result of a system call.
o Synchronous system calls do cause a drop in user-mode IPC due to; direct overhead - the processor exception associated with the system call which flushes the processor pipeline; and indirect overhead
– system call pollution on processors structures.
Exception-less System calls:
- System call batching
o By delaying a series of system calls and executing them in batches you can minimize the frequency of mode switches between user and kernel mode.
o Improves both the direct and indirect cost of system calls.
- Core specialization
o A system call can be scheduled on a different core then the core on which it was invoked, only for exception-less system calls.
o Provides ability to designate a core to run all system calls.
- Exception-less Syscall Interface
o Set of memory pages shared between user and kernel modes. Referred to as Syscall pages.
o User-space threads find a free entry in a syscall page and place a request for a system call. The user-space thread can then continue executing without interruption and must then return to the syscall
page to get the return value from the system call.
o Neither issuing the system call (via the syscall page) nor getting the return value generate an exception.
- Syscall pages
o Each page is a table of syscall entries.
o Each syscall entre has a state:
 Free – means a syscall can be added her
 Submitted – means the kernel can proceed to invoke the appropriate system call operations.
 Done – means the kernel is finished and has provided the return value to the syscall entry. User space thread must return and get the return value from the page.
- Decoupling Execution from Invocation
o To separate these two concepts a special kernel thread, syscall thread, is used.
o Sole purpose is to pull requests from syscall pages and execute them always in kernel mode.
o Syscall threads provide the ability to schedule the system calls on specific cores.
System Calls Galore – FlexSC-Threads
- Programming for exception-less system calls requires a different and more complex way of interacting with the kernel for OS functionality.
o The researchers describe working with exception-less system calls as being similar to event-driven programming in that you do not get the same sequential execution of code as you do with synchronous
system calls.
o In event-driven servers, the researchers suggest using a hybrid of both exception-less system calls (for performance critical paths) and regular synchronous system calls (for less critical system calls).
FlexSC-Threads
- Threading package which transforms synchronous system calls into exception-less system calls.
- Intended use is with server-type applications with which have many user-mode threads (like Apache or MySQL).
- Compatible with both POSIX threads and the default Linux thread library.
o As a result, multi-threaded Linux programs are immediately compatible with FlexSC threads without modification.
- For multi-core systems, a single kernel level thread is created for each core of the system. Multiple user-mode threads are multiplexed onto each kernel level thread via interactions with the syscall pages.
o The syscall pages are private to each kernel level thread, this means each core of a system has a syscall page from which it will receive system calls.
Overhead:
- When running a single exception-less system call against a single synchronous system call, the exception-less call was slower.
- When running a batch of exception-less system calls compared to a bunch of synchronous system calls, the exception-less system calls were much faster.
- The same is true for a remote server situation, one synchronous call is much faster than one exception-less system call but a batch of exception-less system calls is faster than the same number
of synchronous system calls.
Related Work:
- System Call Batching
o Operating systems have a concept called multi-calls which involves collecting multiple system calls and submitting them as a single system call.
o The Cassyopia compiler has an additional process called a looped multi-call where the result of one system call can be fed as an argument to another system call in the same multi-call.
o Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls do.
 Multi-call system calls are executed sequentially, each one must complete before the next may start.
- Locality of Execution and Multicores
o Other techniques include Soft Timers and Lazy Receiver Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to
limit processor interference associated with interrupt handling without affecting the latency of servicing requests.
o Computation Spreading is another locality process which is similar to FlexSC.
 Processor modifications that allow hardware migration of threads and migration to specialized cores.
 Did not model TLBs and on current hardware synchronous thread migration is a costly interprocessor interrupt.
o Also have proposals for dedicating CPU cores to specific operating system functionality.
 These solutions require a microkernel system.
 Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically.
- Non-blocking Execution
o Past research on improving system call performance has focused on blocking versus non-blocking behaviour.
 Typically researchers used threading, event-based (non-blocking) and hybrid systems to obtain high performance on server applications.
o Main difference between past research and FlexSC is that none of the past proposals have decoupled system call execution from system call invocation.
--[[User:Mike Preston|Mike Preston]] 04:03, 20 November 2010 (UTC)

COMP 3000 Essay 2 2010 Question 3

2010-11-22T20:25:20Z

CFaibish: /* Exception-Less System Calls */

COMP 3000 Essay 2 2010 Question 3

2010-11-22T20:20:20Z

CFaibish: /* Exception-Less System Calls */

COMP 3000 Essay 2 2010 Question 3

2010-11-22T20:11:08Z

CFaibish: /* Research Problem: */

Talk:COMP 3000 Essay 2 2010 Question 3

2010-11-15T19:04:23Z

CFaibish: /* Group 3 Essay */

=Group 3 Essay=

Hello everyone, please post your contact information here:

Ben Robson [mailto:brobson@connect.carleton.ca brobson@connect.carleton.ca]

Rey Arteaga: rarteaga@connect.carleton.ca

Corey Faibish: [mailto:corey.faibish@gmail.com corey.faibish@gmail.com]

Tawfic Abdul-Fatah: [mailto:tfatah@gmail.com tfatah@gmail.com]

Fangchen Sun: [mailto:sfangche@connect.carleton.ca sfangche@connect.carleton.ca]

Mike Preston: [mailto:michaelapreston@gmail.com michaelapreston@gmail.com]

Can't access the video without a login as we found out in class, but you can listen to the speech and follow with the slides pretty easily, I just went through it and it's not too bad. Rarteaga

==Question 3 Group==
*Abdul-Fatah Tawfic tafatah
*Arteaga Reynaldo rarteaga
*Faibish Corey cfaibish
*Lawrence Wesley wlawrenc
*Preston Mike mpreston
*Robson Benjamin brobson
*Sun Fangchen sfangche

==Who is working on what ?==
Just to keep track of who's doing what --[[User:Tafatah|Tafatah]] 01:37, 15 November 2010 (UTC)

Talk:COMP 3000 Essay 2 2010 Question 3

2010-11-15T01:14:03Z

CFaibish: /* Group 3 Essay */

=Group 3 Essay=

Hello everyone, please post your contact information here:

Ben Robson [mailto:brobson@connect.carleton.ca brobson@connect.carleton.ca]

Rey Arteaga: rarteaga@connect.carleton.ca

Corey Faibish: corey.faibish@gmail.com

Can't access the video without a login as we found out in class, but you can listen to the speech and follow with the slides pretty easily, I just went through it and it's not too bad. Rarteaga

==Question 3 Group==
*Abdul-Fatah Tawfic tafatah
*Arteaga Reynaldo rarteaga
*Faibish Corey cfaibish
*Lawrence Wesley wlawrenc
*Preston Mike mpreston
*Robson Benjamin brobson
*Sun Fangchen sfangche

Talk:COMP 3000 Essay 1 2010 Question 5

2010-10-14T19:00:07Z

CFaibish: /* Essay Preview */

=Resources=

I just moved the Resources section to our discussion page --[[User:AbsMechanik|AbsMechanik]] 18:19, 13 October 2010 (UTC)

I found some resources, which might be useful to answer this question. As far as I know, FreeBSD uses a Multilevel feeback queue and Linux uses in the current version the completly fair scheduler.
 
-Some text about FreeBSD-scheduling http://www.informit.com/articles/article.aspx?p=366888&seqNum=4
 
-ULE Thread Scheduler: http://www.scribd.com/doc/3299978/ULE-Thread-Scheduler-for-FreeBSD
 
-Completly Fair Scheduler: http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt
 
-Brain Fuck Scheduler: http://en.wikipedia.org/wiki/Brain_Fuck_Scheduler
 
-Sebastian

Also found a nice link with regards to the new Linux Scheduler for those interested:
http://www.ibm.com/developerworks/linux/library/l-scheduler/
 It is also referred to as the O(1) scheduler in algorithmic terms (CFS is O(log(n)) scheduler). Both have been in development by Ingo Molnár.
-Abhinav

Some more resources; 
http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html (includes history of Linux scheduler from 1.2 to 2.6) 
http://my.opera.com/blu3c4t/blog/show.dml/1531517 
-Wes 

 
Information on changes to the O(1) scheduler: 
"Linux Kernel Documentation" 
http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt 
 
General information on Linux Job Scheduling: 
"Linux Job Scheduling | Linux Journal" 
http://www.linuxjournal.com/article/4087 
 
Scheduling on multi-core Linux machines: 
"Node affine NUMA scheduler for Linux" 
http://home.arcor.de/efocht/sched/ 
 
More on Linux process scheduling: 
"Understanding the Linux kernel" 
http://oreilly.com/catalog/linuxkernel/chapter/ch10.html 
 
FreeBSD thread scheduling: 
"InformIT: FreeBSD Process Management" 
http://www.informit.com/articles/article.aspx?p=366888&seqNum=4 
- Austin Bondio

=Discussion=

From what I have been reading the early versions of the Linux scheduler had a very hard time managing high numbers of tasks at the same time. Although I do not how it ran, the scheduler algorithm operated at O(n) time. As a result as more tasks were added, the scheduler would become slower. In addition to this, a single data structure was used to manage all processors of a system which created a problem with managing cached memory between processors. The Linux 2.6 scheduler was built to resolve the task management issues in O(1), constant, time as well as addressing the multiprocessing issues.

It appears as though BSD also had issues with task management however for BSD this was due to a locking mechanism that only allowed one process at a time to operate in kernel mode. FreeBSD 5 changed this locking mechanism to allow multiple processes the ability to run in kernel mode at the same time advancing the success of symmetric multiprocessing.

--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

Hi Mike,
Can you give any names for the schedulers you are talking about? I think it is easier to distinguish by names and not by the algorithm. It is just a suggestion!

The O(1) scheduler was replaced in the linux kernel 2.6.23 with the CFS (completly fair scheduler) which runs in O(log n). Also, the schedulers before CFS were based on a Multilevel feedback queue algorithm, which was changed in 2.6.23. It is not based on a queue as most schedulers, but on a red-black-tree to implement a timeline to make future predictions. The aim of CFS is to maximize CPU utilization and maximizing the performance at the same time.

In FreeBSD 5, the ULE Scheduler was introduced but disabled by default in the early versions, which eventually changed later on. ULE has better support for SMP and SMT, thus allowing it to improve overall performance in uniprocessors and multiprocessors. And it has a constant execution time, regardless of the amount of threads.

More information can be found here:
 
http://lwn.net/Articles/230574/
 
http://lwn.net/Articles/240474/

[[User:Sschnei1|Sschnei1]] 16:33, 3 October 2010 (UTC) or Sebastian

Here is another article which essentially backs up what you are saying Sebastian: http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156

Here are the highlights from the article:

General FreeBSD knowledge:
1. requires a scheduler to be selected at the time the kernel is built.
2. all calls to scheduling code are resolved at compile time...this means that the overhead of indirect function calls for scheduling decisions is eliminated.
3. kernels up to FreeBSD 5.1 used this scheduler, but from 5.2 onward the ULE scheduler used.

Original FreeBSD Scheduler:
1. threads assigned a scheduling priority which determines which 'run queue' the thread is placed in.
2. the system scans the run queues in order of highest priority to lowest priority and executes the first thread of the first non-empty run queue it finds.
3. once a non-empty queue is found the system spends an equal time slice on each thread in the run queue. This time slice is 0.1 seconds and this value has not changed in over 20 years. A shorter time slice would cause overhead due to switching between threads too often thus reducing productivity.
4. the article then provides detailed formulae on how to determine thread priority which is out of our scope for this project.

ULE Scheduler
- overhaul of Original BSD scheduler to:
1. support symmetric multiprocessing (SMP)
2. support symmetric multithreading (SMT) on multi-core systems
3. improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

Here is another article which gives some great overview of a bunch of versions/the evolution of different schedulers: https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf
 
Some interesting pieces about the Linux scheduler include:
1. The Jan 2002 version included O(1) algorithm as well as additions for SMP.
2. Scheduler uses 2 priority queue arrays to achieve fairness. Does this by giving each thread a time slice and a priority and executes each thread in order of highest priority to lowest. Threads that exhaust their time slice are moved to the exhausted queue and threads with remaining time slices are kept in the active queue.
3. Time slices are DYNAMIC, larger time slices are given to higher priority tasks, smaller slices to lower priority tasks.
 
I thought the dynamic time slice piece was of particular interest as you would think this would lead to starvation situations if the priority was high enough on one or multiple threads.
--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

This is essentially a summarized version of the aforementioned information regarding CFS (http://www.ibm.com/developerworks/linux/library/l-scheduler/).
--[[User:AbsMechanik|AbsMechanik]] 02:32, 4 October 2010 (UTC)

I have seen this website and thought it is useful. Do you think this is enough on research to write an essay or are we going to do some more research?
--[[User:Sschnei1|Sschnei1]] 09:38, 5 October 2010 (UTC)

I also stumbled upon this website: http://my.opera.com/blu3c4t/blog/show.dml/1531517. It explains a lot of stuff in layman's terms (I had a lot of trouble finding more info on the default BSD scheduler, but this link has some brief description included in it). I think we have enough resources/research done. We should start to formulate these results into an answer now. --[[User:AbsMechanik|AbsMechanik]] 20:08, 4 October 2010 (UTC)

So I thought I would take a first crack at an intro for our article, please tell me what you think of the following. Note that I have included the resource used as a footnote, the placement of which I indicate with the number 1, and I just tacked the details of the footnote on at the bottom:

See Essay preview section!

--[[User:Mike Preston|Mike Preston]] 02:54, 6 October 2010 (UTC)

I added a part to introduce the several schedulers for LINUX. We might need to change the reference, since I got it all from http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

-- [[User:Sschnei1|Sschnei1]] 19:27, 9 October 2010 (UTC)

Maybe we should write down our contact emails and names to write down who would like to write what part.

Another suggestion is that someone should read over the text and compare it to the references posted in the "Sources" section and check if someone is doing plagiarism.

Sebastian Schneider - sebastian@gamersblog.ca

= Essay Preview =

So just a small, quick question. Are we going to follow a certain standard for citing resources (bibliography & footnotes) to maintain consistency, or do we just stick with what Mike's presented?--[[User:AbsMechanik|AbsMechanik]] 12:53, 7 October 2010 (UTC)

Maybe we should write the essay templates/prototypes here, to keep overview of the discussion part.

Just relocating previous post with suggested intro paragraph:

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system.1 As computer hardware has increased in complexity, for example multiple core CPUs, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers; the default BSD/FreeBSD and Linux schedulers.

1 Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

--[[User:Mike Preston|Mike Preston]] 03:48, 7 October 2010 (UTC)

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be
efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea
of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was
the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its
predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a
scheduling event. The scheduler divided tasks into epochs, allowing each tasks to execute up to its time slice. If a task
did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute
longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and
did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware
architectures, such as multi-core processors.

Linux-2.6 introduced another scheduler up to Linux 2.6.23. Before Linux 2.6.23 an O(1) scheduler was used. It needed the
same amount of time for each task to execute, independent of how big the tasks were.It kept track of the tasks in a
running queue. The scheduler offered much more scalability. To determine if a task was I/O bound or processor bound the
scheduler used interactive metrics with numerous heuristics. Because the code was difficult to manage and the most part of
the code was to calculate heuristics, it was replaced in Linux 2.6.23 with the CFS scheduler, which is the current
scheduler in the actual Linux versions.

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining
fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor.
When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep
fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual
runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing
and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently.
Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu
are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The
scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted
into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side
of the tree are migrated to the left side to maintain fairness. [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html]

-- [[User:Sschnei1|Sschnei1]] 19:26, 9 October 2010 (UTC)

I've started writing a bit about the Linux O(1) scheduler:

Under a Linux system, scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.[http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt]

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.[http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726]

Here's something I put into the Linux: Overview section:
 I (Sschnei1) added some text to the Round-Robin Scheduling and the Multilevel Queue Scheduling.

The Linux kernel has undergone many changes over the decades since its original release as the UNIX operating system in 1969.[http://www.unix.com/whats-your-mind/110099-unix-40th-birthday.html] The early versions had relatively inefficient schedulers which operated in linear time with respect to the number of tasks to schedule; currently the Linux scheduler is able to operate in constant time, independent of the number of tasks being scheduled.

There are five basic algorithms for allocating CPU time[http://en.wikipedia.org/wiki/Scheduling_(computing)#Scheduling_disciplines]: <ul>
<li>First-in, First-out: No multi-tasking. Processes are queued in the order they are called. A process gets full, uninterrupted use of the CPU until it has finished running.</li>
<li>Shortest Time Remaining: Limited multi-tasking. The CPU handles the easiest tasks first, and complex, time-consuming tasks are handled last.</li>
<li>Fixed-Priority Preemptive Scheduling: Greater multi-tasking. Processes are assigned priority levels which are independent of their complexity. High-priority processes can be completed quickly, while low-priority processes can take a long time as new, higher-priority processes arrive and interrupt them.</li>
<li>Round-Robin Scheduling: Fair multi-tasking. This method is similar in concept to Fixed-Priority Preemptive Scheduling, but all processes are assigned the same priority level; that is, every running process is given an equal share of CPU time. The Round-Robin Scheduling is used in Linux-1.2.</li>
<li>Multilevel Queue Scheduling: Rule-based multi-taksing. This method is also similar to Fixed-Priority Preemptive Scheduling, but processes are associated with groups that help determine how high their priorities are. For example, all I/O tasks get low priority since much time is spent waiting for the user to interact with the system. The O(1) algorithm in 2.6 up to 2.6.23 is based on a Multilevel Queue.</li>
</ul>

-- [[User:abondio2|Austin Bondio]] Last edit: 22:27, 13 October 2010 (UTC)

-- [[User:Sschnei1|Sschnei1]] 00:24, 14 October 2010 (UTC)

I'm writing on a contrast of the CFS scheduler right now, please don't edit it.

In contrast the the O(1) scheduler, CFS realizes the model of a scheduler which can execute precise on real multitasking on real hardware. Precise multitasking means that each process can run at equal speed. If 4 processes are running at the same time, CFS assigns 25% of the CPU time to each process. On real hardware, only one task can be executed at a time and other tasks have to wait, which gives the running tasks an unfair amount of CPU time.

To avoid an unfair balance over the processes, CFS has a wait run-time for each process. CFS tries to pick the process with the highest wait run-time value. To provide a real multitasking, CFS splits up the CPU time between running processes. This allows multiple processes to parallel on a single CPU.

Processes are not stored in a run queue, such in the O(1) scheduler, but in a self-balancing red-black tree, where self-balancing means that the task with the highest need for CPU time is stored in the most left node. Tasks with a lower need for CPU time are stored on the right side of the Tree, where tasks with a higher need for CPU time are stored on the left side. The task on the left side is picked by the scheduler and put in a virtual runtime. If the process is ready to run, it is given CPU time to run. The tree re-balances itself and new tasks can be taken out by the CPU.

CFS is designed in a way that it does not need to do timeslicing on the CPU, and still provide most performance with as much CPU utilization. This is due to the nanosecond granularity, which removes the need for jiffies or other HZ details. [http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt]

-- [[User:Sschnei1|Sschnei1]] 16:32, 13 October 2010 (UTC)

Hey guys, sorry I've been non-existent for the past little bit, here's what I've done so far. I've been going through stuff on the 4BSD and ULE schedulers, here's what I have so far:

In order for FreeBSD to function, it requires a scheduler to be selected at the time the kernel is built. Also, all calls to scheduling code are resolved at compile time, meaning that the overhead of indirect function calls for scheduling decisions is eliminated.

[3] The 4BSD scheduler was a general-purpose scheduler. Its primary goal was to balance threads’ different scheduling requirements. FreeBSD's time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements and the amount consumed by the thread. Based on the thread's priority, it gets moved between run queues. When a new thread attains a higher priority than the currently running one, the system immediately switches to the new thread, if it's in user mode. Otherwise, the system switches as soon as the current thread leaves the kernel. The system scans the run queues in order of highest to lowest priority, and executes the first thread of the first non-empty run queue it finds. The system tailors it's short-term scheduling algorithm to favor user-interactive jobs by raising the priority of threads waiting for I/O for one or more seconds, and by lowering the priority of threads that hog up significant amounts of CPU time.

[1] In older BSD systems, (and I mean old, as in 20 or so years ago), a 1 second quantum was used for the round-robin scheduling algorithm. Later, in BSD 4.2, it did rescheduling every 0.1 seconds, and priority re-computation every second, and these values haven’t changed since. Round-robin scheduling is done by a timeout mechanism, which informs the clock interrupt driver to call a certain system routine after a specified interval. The subroutine to be called, in this case, causes the rescheduling and then resubmits a timeout to call itself again 0.1 sec later. The priority re-computation is also timed by a subroutine that resubmits a timeout for itself.

The ULE Scheduler was first introduced in FreeBSD 5, however disabled by default in favor of the default 4BSD scheduler. It was not until FreeBSD 7.1 that the ULE scheduler became the new default. The ULE scheduler was an overhaul of the original scheduler, and allowed it support for symmetric multiprocessing (SMP), support for symmetric multithreading (SMT) on multi-core systems, and improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

The ULE has a constant execution time of O(1), regardless of the number of threads. In addition, it also is careful to identify interactive tasks and give them the lowest latency response possible. The scheduling components include several queues, two CPU load-balancing algorithms, an interactivity scorer, a CPU usage estimator, a priority calculator, and a slice calculator.

Since ULE is an event driven scheduler, there is no periodic timer that adjusts thread priority to prevent starvation. Fairness is implemented by maintaining two queues: the current and the next queue. Each thread that is granted a CPU slice is assigned to either the current or next queue. Threads are then picked from the current queue in order of priority until it is empty, which is when the next and current queues are switched, and the process begins again. This guarantees that each thread will be given use of its slice once every two queue cycles, regardless of priority. Interrupt and real-time threads (and threads with these priority levels) are inserted onto the current queue, for they are of the highest priority. There is also an idle class of threads, which is checked only when there are no other runnable tasks.

In order to promptly discover when a thread changes from interactive to non-interactive, the ULE scheduler uses Interactivity Scoring, a key part of the scheduler affecting the responsiveness of the system, and subsequently, user experience. An interactivity score is computed from the relationship between run and sleep time, using a formula that is out of scope for this project. Interactive threads usually have high sleep times because they are often waiting for user input. This usually is followed by quick bursts in CPU activity from processing the user's request.

The ULE also implements a priority calculator, not to maintain fairness, but only to order the threads by priority. Only time-sharing threads use the priority calculator, the rest are allotted statically. ULE uses the interactivity score to determine the nice value of a thread, which will allow it to in turn decide upon its priority.

The way that the ULE uses its nice values in combination with slices is by implementing a moving window of nice values allowed slices. The threads within the window are given a slice oppositely proportional to the difference between their nice value and the lowest recorded nice value. This results in smaller slices to nicer threads, which subsequently defines their amount of allotted CPU time. On x86, FreeBSD allows for a minimum slice value of 10ms, and a maximum of 140ms. Interactive tasks receive the smallest nice value, in order to more promptly find that the interactive task is no longer interacting.

The ULE uses a CPU usage estimator to show roughly how much CPU a thread is using. It operates on an event driven basis. ULE keeps track of the number of clock ticks that occurred within a sliding window of the thread's execution time. The window slides to upwards of one second past threshold, and then back down to regulate the ratio of run time to sleep time.

The ULE enables SMP (Symmetric Multiprocessing) in order to greater achieve CPU affinity, which is when you schedule threads onto the last CPU they were run on, as to avoid unnecessary CPU migrations. Processors typically have large caches to aid the performance of threads and processes. CPU affinity is key because the thread may still have leftover data in the caches of the previous CPU, and when a thread migrates, it not only has to load this data into the new CPU's cache, but must also clear the data from the previous CPU's cache. ULE has two methods for CPU load balancing: pull and push. Pull is when an idle CPU grabs a thread from a non-idle CPU to lend a hand. Push is when a periodic task evaluates the current load situation of the CPUs and balances it out amongst them. These two function side by side, and allow for an optimally balanced CPU workload.

SMT (Symmetric Multithreading), a concept of non-uniform processors, is not fully present in ULE. The foundations are there, however, which can eventually be extended to support NUMA (Non-Uniform Memory Architecture). This involves expressing the penalties of CPU migration through separate queues, which could be extended to add a local and global load-balancing policy. As far as my sources go, FreeBSD does not at this point support NUMA, however the groundwork is there, and it is a real possibility for it to appear in a future version.

1 = http://www.cim.mcgill.ca/~franco/OpSys-304-427/lecture-notes/node46.html
2 = http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc
3 = McKusick, M. K. and Neville-Neil, G. V. 2004. Thread Scheduling in FreeBSD 5.2. Queue 2, 7 (Oct. 2004), 58-64. DOI= http://doi.acm.org/10.1145/1035594.1035622

Notes: Lots of this is just paraphrasing stuff you guys said in the discussion section. In terms of citations, should it be a superscripted citation next to the fact snippet we used, or should it just be a list of sources at the bottom?

--[[User:CFaibish|CFaibish]] 23:27, 13 October 2010 (UTC)

I would agree with putting superscripted citations that refer to the Sources section? How do they do it in the wikipedia?
-- [[User:Sschnei1|Sschnei1]] 18:52, 13 October 2010 (UTC)

Superscripted citations seems to be the best way to do it. If we cite URLs throughout the essay, it will be much harder to read. To put in a superscripted citation, enclose the URL of your source in square brackets.

Also, who here is actually good at writing, and can compile all these paragraphs into one nice essay for us? I think we have enough raw information here, it's just a matter of putting it all together now.

-- [[abondio2|Austin Bondio]] 20:39, 13 October 2010 (UTC)

Abhinav is putting something together right now on the main page.

-- [[User:Sschnei1|Sschnei1]] 20:56, 13 October 2010 (UTC)

Hi, here's a little forward on schedulers in relation to types of threads I've composed based off of one of my sources, I'm not sure if its necessary since there is one Mike typed above, but here it just for you guys to examine:

Threads that perform a lot of I/O require a fast response time to keep input and output devices busy, but need little CPU time. On the other hand, compute-bound threads need to receive a lot of CPU time to finish their work, but have no requirement for fast response time. Other threads lie somewhere in between, with periods of I/O punctuated by periods of computation, and thus have requirements that vary over time. A well-designed scheduler should be able accommodate threads with all these requirements simultaneously.

Also: as Mike said earlier about BSD's issue with locking mechanisms, should I go into greater detail about that, or just include a little, few sentence description of the issue? I've found a source for what I think is what he was referring to: http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc

I'll be posting more of what I've got on the BSD stuff under the hour.

--[[User:CFaibish|CFaibish]] 22:59, 13 October 2010 (UTC)

I'm reading through what's up there now looking for typos/rewording certain aspects [mostly typos tho, this is very well written]. Should I just go and edit it? Or should I compose a mirrored version and go from there?

--[[User:CFaibish|CFaibish]] 14:42, 14 October 2010 (UTC)

Since their just minimal fixes like commas, pluralization, and apostrophes, I'll just go ahead and edit them in-essay. I've saved a backup as well in case you guys feel I ruined anything.

--[[User:CFaibish|CFaibish]] 14:52, 14 October 2010 (UTC)

= Sources =

[1] http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

[2] http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt

[3] http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726

[4] http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt

COMP 3000 Essay 1 2010 Question 5

2010-10-14T15:20:59Z

CFaibish: /* Answer */

=Question=

Compare and contrast the evolution of the default BSD/FreeBSD and Linux schedulers.

=Answer=

(This work belongs to --[[User:Mike Preston|Mike Preston]] 23:01, 13 October 2010 (UTC))
(modified by --[[User:AbsMechanik|AbsMechanik]] 03:00, 14 October 2010 (UTC))

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with, all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system (Jensen: 1985).
There are several different algorithms which are utilized in different schedulers, but a few key algorithms are outlined below[http://joshaas.net/linux/linux_cpu_scheduler.pdf][http://www.sci.csueastbay.edu/~billard/cs4560/node6.html][http://www.articles.assyriancafe.com/documents/CPU_Scheduling.pdf]: <ul>

<li>First-Come, First-Serve (also known as FIFO): No multi-tasking. Processes are queued in the order they are called. A process gets full, uninterrupted use of the CPU until it has finished running.</li>

<li>Shortest Job First (similar to Shortest Remaining Time and/or Shortest Process Next): Limited multi-tasking. The CPU handles the easiest tasks first, and complex, time-consuming tasks are handled last.</li>

<li>Fixed-Priority Preemptive Scheduling: Greater multi-tasking. Processes are assigned priority levels which are independent of their complexity. High-priority processes can be completed quickly, while low-priority processes can take a long time as new, higher-priority processes arrive and interrupt them.</li>

<li>Round-Robin Scheduling: Fair multi-tasking. This method is similar in concept to First-Come, First-Serve, but all processes are assigned the same priority level; that is, every running process is given an equal share of CPU time.</li>

<li>Multilevel Feedback Queue Scheduling: Rule-based multi-tasking. It is a combination of First-Come, First-Serve, Round-Robin & Fixed-Priority Preemptive Scheduling, but processes are associated with groups that help determine how high their priorities are. For example, all I/O tasks get low priority since much time is spent waiting for the user to interact with the system.</li>

</ul>
There is no one "best" algorithm and most schedulers utilize a combination of the different algorithms, such as the Multi-Level Feedback Queue, which in one way or another was utilized in Win XP/Vista, Linux 2.5-2.6, FreeBSD, Mac OSX, NetBSD and Solaris. One thing for certain is that as computer hardware increases in complexity, such as multiple core CPUs (parallelization), and with the advent of more powerful embedded/mobile devices, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers:
<ul>
<li>'''The default BSD/FreeBSD scheduler'''</li>
<li>'''The Linux scheduler'''</li>
</ul>

==BSD/Free BSD Schedulers==

===Overview & History===

(This work belongs to --[[User:Mike Preston|Mike Preston]] 23:41, 13 October 2010 (UTC))
(modified by --[[User:AbsMechanik|AbsMechanik]] 13:21, 14 October 2010 (UTC))

The FreeBSD kernel originally inherited its scheduler from 4.3BSD which itself is a version of the UNIX scheduler [http://dspace.hil.unb.ca:8080/bitstream/handle/1882/100/roberson.pdf?sequence=1].
In order to understand the evolution of the FreeBSD scheduler it is important to understand the original purpose and limitations of the BSD scheduler. Like most traditional UNIX based systems, the BSD scheduler was designed to work on a single core computer system (with limited I/O) and handle relatively small numbers of processes. As a result, managing resources with an O(n) scheduler did not raise any performance issues. To ensure fairness, the scheduler would switch between processes every 0.1 second (100 milliseconds) in a round-robin format [http://www.thehackademy.net/madchat/ebooks/sched/FreeBSD/the_FreeBSD_process_scheduler.pdf].

As computer systems increased in complexity with the advent of multi-core CPUs and various new I/O devices, computer programs, naturally, increased in size and complexity to accommodate and manage the new hardware. With CPUs becoming more powerful (derived from Moore's Law [http://www.intel.com/technology/mooreslaw/]), the time taken to complete a process decreased significantly. This additional complexity highlighted the problem of having an O(n) scheduler for managing processes, as more items were added to the scheduling algorithm, the performance decreased. With symmetric multiprocessing (SMP) becoming inevitable (multi-core CPU's) a better scheduler was required. This was the driving force behind the creation of ULE for the FreeBSD.

===Older Versions===

(This work belongs to --[[User:Mike Preston|Mike Preston]] 00:02, 14 October 2010 (UTC))

The FreeBSD kernel originally used an enhanced version of the BSD scheduler. Specifically, the FreeBSD scheduler included classes of threads, which was a drastic change from the round-robin scheduling used in BSD. Initially, there were two types of thread class, real-time and idle [https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf], and the scheduler would give processor time to real-time threads first and the idle threads had to wait until there were no real-time threads that needed access to the processor.

To manage the various threads, FreeBSD had data structures called runqueues, into which the threads were placed. The scheduler would evaluate the runqueues based on priority from highest to lowest and execute the first thread of a non-empty runqueue it found. Once a non-empty runqueue was found, each thread in the runqueue would be assigned an equal value time slice of 0.1 seconds, a value that has not changed in over 20 years [http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156].

Unfortunately, like the BSD scheduler it was based on, the original FreeBSD scheduler was not built to handle Symmetric Multiprocessing (SMP) or Symmetric Multithreading (SMT) on multi-core systems. The scheduler was still limited by an O(n) algorithm, which could not efficiently handle the loads required on ever increasingly powerful systems.
To allow FreeBSD to operate with more modern computer systems, it became clear that a new scheduler would be required, and thus, became the driving force behind the implementation of ULE.

===Current Version===

(This work is owned by --[[User:Mike Preston|Mike Preston]] 00:23, 14 October 2010 (UTC))

In order to effectively manage multi-core computer systems, FreeBSD needed a scheduler with an algorithm that would execute in constant time regardless of the number of threads involved. The ULE scheduler was designed for this purpose. It is of interest to note that throughout the course of the BSD/FreeBSD scheduler evolution, each iteration has just been an improvement on existing scheduler technologies. Although each version was designed to provide support for some current reality of computing, like multi-core systems, the evolution was out of necessity and not due to a desire to re-evaluate how the current version accomplished its tasks.

The "slow" evolution of the FreeBSD scheduler becomes even more evident when comparing it to the Linux scheduler, which has evolved through a series of attempts to provide alternative ways to solve scheduling tasks. From dynamic time slices, to various data structure implementations, and even various ways of describing priority levels (see: "nice" levels), the Linux scheduler advancement has occurred through a series of drastic changes. In comparison, the FreeBSD scheduler has been changed only when the current version was no longer able to meet the needs of the existing computing climate.

==Linux Schedulers==

(Note to the other group members: Feel free to modify or remove anything I post here. I'm just trying to piece together what you've all posted in the discussion section and turn it into a single paragraph. You know. Just to see how it looks.)

-- [[User:abondio2|Austin Bondio]] Last edit: 22:17, 13 October 2010 (UTC)

(Same for me, I'm trying to put together the overview/history and work on the comparison section of the essay, all based off the history you guys give. If I miss anything or get anything wrong, feel free to correct.)

-- [[User:Wlawrenc|Wesley Lawrence]]

(Austin - I added a reference to one of your sections as the current reference only went to wikipedia which the prof has kind of implied is not a good idea, I also added another one that was to a blog post as that was another thing the prof mentioned was not the best idea. I am hoping this will provide additional alidations to the sources.)
--[[User:Mike Preston|Mike Preston]] 00:29, 14 October 2010 (UTC)

===Overview & History===

(This work belongs to [[User:Wlawrenc|Wesley Lawrence]])

The Linux scheduler has a large history of improvement, always aiming towards having a fair and fast scheduler. Various methods and concepts have been tried over different versions to get this fair and fast scheduler, including round robin, iterations, and queues. A quick read through of the history of Linux implies that firstly, equal and balanced use of the system was the goal of the scheduler, and once that was in place, speed was soon improved. Early schedulers did their best to give processes equal time and resources, but used a bit of extra time (in computer terms) to accomplish this. By Linux 2.6, after experimenting with different concepts, the scheduler was able to provide fair access and time, as well as run as quickly as possible, with various features to allow personal tweaking by the system user, or even the processes themselves.

(This work was done by [[User:abondio2|Austin Bondio]], modified by [[User:Sschnei1|Sschnei1]] )

The Linux kernel has undergone many changes over the decades since its original release as the UNIX operating system in 1969 [http://www.unix.com/whats-your-mind/110099-unix-40th-birthday.html](Stallings: 2009). The early versions had relatively inefficient schedulers, which operated in linear time with respect to the number of tasks to schedule; currently the Linux scheduler is able to operate in constant time, independent of the number of tasks being scheduled.

===Older Versions===

(This work belongs to [[User:Sschnei1|Sschnei1]])

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was the first scheduler that supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a scheduling event. The scheduler divided tasks into epochs, allowing each task to execute up to its time slice. If a task did not use up its entire time slice, the remaining time was added to the next time slice to allow the task to execute longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability, and did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware architectures, such as multi-core processors.

===Current Version===

(This work was done by [[User:Sschnei1|Sschnei1]])

As of the Linux 2.6.23 introduction, the CFS (Completely Fair Scheduler) took its place in the kernel. CFS uses the idea of maintaining fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor. When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual runtime.

The model of how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently. Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu are stored in the right side of the tree. To keep fairness, the scheduler takes the left-most node from the tree. The scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable, the task then is inserted into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side of the tree are migrated to the left side to maintain fairness.

(This work was done by [[User:abondio2|Austin Bondio]])

Under a recent Linux system (version 2.6.35 or later), scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature, which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.

==Tabulated Results==

(Once I read/see some history on the BSD section above, I'll do the best comparison I can. I'm balancing 3000/3004 and other courses (like most of you), so I don't think I can research/write BSD and write the comparison, but I will try to help out as much as I can)

-- [[User:Wlawrenc|Wesley Lawrence]]

I've got this. Hopefully most of the sections I created properly answer the question. I'm still going to go over everyone's answers and keep in mind that wikipedia cannot be cited as a resource. --[[User:AbsMechanik|AbsMechanik]] 02:29, 14 October 2010 (UTC)

==Current Challenges==

== Resources ==

1. Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

2. Stallings, William, Operating Systems: Internals and Design Principles, Pearson Prentice Hall, 2009.

3. McKusick, M. K. and Neville-Neil, G. V. 2004. Thread Scheduling in FreeBSD 5.2. Queue 2, 7 (Oct. 2004), 58-64. DOI= http://doi.acm.org/10.1145/1035594.1035622

Talk:COMP 3000 Essay 1 2010 Question 5

2010-10-14T15:13:22Z

CFaibish: /* Essay Preview */

=Resources=

I just moved the Resources section to our discussion page --[[User:AbsMechanik|AbsMechanik]] 18:19, 13 October 2010 (UTC)

I found some resources, which might be useful to answer this question. As far as I know, FreeBSD uses a Multilevel feeback queue and Linux uses in the current version the completly fair scheduler.
 
-Some text about FreeBSD-scheduling http://www.informit.com/articles/article.aspx?p=366888&seqNum=4
 
-ULE Thread Scheduler: http://www.scribd.com/doc/3299978/ULE-Thread-Scheduler-for-FreeBSD
 
-Completly Fair Scheduler: http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt
 
-Brain Fuck Scheduler: http://en.wikipedia.org/wiki/Brain_Fuck_Scheduler
 
-Sebastian

Also found a nice link with regards to the new Linux Scheduler for those interested:
http://www.ibm.com/developerworks/linux/library/l-scheduler/
 It is also referred to as the O(1) scheduler in algorithmic terms (CFS is O(log(n)) scheduler). Both have been in development by Ingo Molnár.
-Abhinav

Some more resources; 
http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html (includes history of Linux scheduler from 1.2 to 2.6) 
http://my.opera.com/blu3c4t/blog/show.dml/1531517 
-Wes 

 
Information on changes to the O(1) scheduler: 
"Linux Kernel Documentation" 
http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt 
 
General information on Linux Job Scheduling: 
"Linux Job Scheduling | Linux Journal" 
http://www.linuxjournal.com/article/4087 
 
Scheduling on multi-core Linux machines: 
"Node affine NUMA scheduler for Linux" 
http://home.arcor.de/efocht/sched/ 
 
More on Linux process scheduling: 
"Understanding the Linux kernel" 
http://oreilly.com/catalog/linuxkernel/chapter/ch10.html 
 
FreeBSD thread scheduling: 
"InformIT: FreeBSD Process Management" 
http://www.informit.com/articles/article.aspx?p=366888&seqNum=4 
- Austin Bondio

=Discussion=

From what I have been reading the early versions of the Linux scheduler had a very hard time managing high numbers of tasks at the same time. Although I do not how it ran, the scheduler algorithm operated at O(n) time. As a result as more tasks were added, the scheduler would become slower. In addition to this, a single data structure was used to manage all processors of a system which created a problem with managing cached memory between processors. The Linux 2.6 scheduler was built to resolve the task management issues in O(1), constant, time as well as addressing the multiprocessing issues.

It appears as though BSD also had issues with task management however for BSD this was due to a locking mechanism that only allowed one process at a time to operate in kernel mode. FreeBSD 5 changed this locking mechanism to allow multiple processes the ability to run in kernel mode at the same time advancing the success of symmetric multiprocessing.

--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

Hi Mike,
Can you give any names for the schedulers you are talking about? I think it is easier to distinguish by names and not by the algorithm. It is just a suggestion!

The O(1) scheduler was replaced in the linux kernel 2.6.23 with the CFS (completly fair scheduler) which runs in O(log n). Also, the schedulers before CFS were based on a Multilevel feedback queue algorithm, which was changed in 2.6.23. It is not based on a queue as most schedulers, but on a red-black-tree to implement a timeline to make future predictions. The aim of CFS is to maximize CPU utilization and maximizing the performance at the same time.

In FreeBSD 5, the ULE Scheduler was introduced but disabled by default in the early versions, which eventually changed later on. ULE has better support for SMP and SMT, thus allowing it to improve overall performance in uniprocessors and multiprocessors. And it has a constant execution time, regardless of the amount of threads.

More information can be found here:
 
http://lwn.net/Articles/230574/
 
http://lwn.net/Articles/240474/

[[User:Sschnei1|Sschnei1]] 16:33, 3 October 2010 (UTC) or Sebastian

Here is another article which essentially backs up what you are saying Sebastian: http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156

Here are the highlights from the article:

General FreeBSD knowledge:
1. requires a scheduler to be selected at the time the kernel is built.
2. all calls to scheduling code are resolved at compile time...this means that the overhead of indirect function calls for scheduling decisions is eliminated.
3. kernels up to FreeBSD 5.1 used this scheduler, but from 5.2 onward the ULE scheduler used.

Original FreeBSD Scheduler:
1. threads assigned a scheduling priority which determines which 'run queue' the thread is placed in.
2. the system scans the run queues in order of highest priority to lowest priority and executes the first thread of the first non-empty run queue it finds.
3. once a non-empty queue is found the system spends an equal time slice on each thread in the run queue. This time slice is 0.1 seconds and this value has not changed in over 20 years. A shorter time slice would cause overhead due to switching between threads too often thus reducing productivity.
4. the article then provides detailed formulae on how to determine thread priority which is out of our scope for this project.

ULE Scheduler
- overhaul of Original BSD scheduler to:
1. support symmetric multiprocessing (SMP)
2. support symmetric multithreading (SMT) on multi-core systems
3. improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

Here is another article which gives some great overview of a bunch of versions/the evolution of different schedulers: https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf
 
Some interesting pieces about the Linux scheduler include:
1. The Jan 2002 version included O(1) algorithm as well as additions for SMP.
2. Scheduler uses 2 priority queue arrays to achieve fairness. Does this by giving each thread a time slice and a priority and executes each thread in order of highest priority to lowest. Threads that exhaust their time slice are moved to the exhausted queue and threads with remaining time slices are kept in the active queue.
3. Time slices are DYNAMIC, larger time slices are given to higher priority tasks, smaller slices to lower priority tasks.
 
I thought the dynamic time slice piece was of particular interest as you would think this would lead to starvation situations if the priority was high enough on one or multiple threads.
--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

This is essentially a summarized version of the aforementioned information regarding CFS (http://www.ibm.com/developerworks/linux/library/l-scheduler/).
--[[User:AbsMechanik|AbsMechanik]] 02:32, 4 October 2010 (UTC)

I have seen this website and thought it is useful. Do you think this is enough on research to write an essay or are we going to do some more research?
--[[User:Sschnei1|Sschnei1]] 09:38, 5 October 2010 (UTC)

I also stumbled upon this website: http://my.opera.com/blu3c4t/blog/show.dml/1531517. It explains a lot of stuff in layman's terms (I had a lot of trouble finding more info on the default BSD scheduler, but this link has some brief description included in it). I think we have enough resources/research done. We should start to formulate these results into an answer now. --[[User:AbsMechanik|AbsMechanik]] 20:08, 4 October 2010 (UTC)

So I thought I would take a first crack at an intro for our article, please tell me what you think of the following. Note that I have included the resource used as a footnote, the placement of which I indicate with the number 1, and I just tacked the details of the footnote on at the bottom:

See Essay preview section!

--[[User:Mike Preston|Mike Preston]] 02:54, 6 October 2010 (UTC)

I added a part to introduce the several schedulers for LINUX. We might need to change the reference, since I got it all from http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

-- [[User:Sschnei1|Sschnei1]] 19:27, 9 October 2010 (UTC)

Maybe we should write down our contact emails and names to write down who would like to write what part.

Another suggestion is that someone should read over the text and compare it to the references posted in the "Sources" section and check if someone is doing plagiarism.

Sebastian Schneider - sebastian@gamersblog.ca

= Essay Preview =

So just a small, quick question. Are we going to follow a certain standard for citing resources (bibliography & footnotes) to maintain consistency, or do we just stick with what Mike's presented?--[[User:AbsMechanik|AbsMechanik]] 12:53, 7 October 2010 (UTC)

Maybe we should write the essay templates/prototypes here, to keep overview of the discussion part.

Just relocating previous post with suggested intro paragraph:

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system.1 As computer hardware has increased in complexity, for example multiple core CPUs, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers; the default BSD/FreeBSD and Linux schedulers.

1 Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

--[[User:Mike Preston|Mike Preston]] 03:48, 7 October 2010 (UTC)

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be
efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea
of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was
the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its
predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a
scheduling event. The scheduler divided tasks into epochs, allowing each tasks to execute up to its time slice. If a task
did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute
longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and
did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware
architectures, such as multi-core processors.

Linux-2.6 introduced another scheduler up to Linux 2.6.23. Before Linux 2.6.23 an O(1) scheduler was used. It needed the
same amount of time for each task to execute, independent of how big the tasks were.It kept track of the tasks in a
running queue. The scheduler offered much more scalability. To determine if a task was I/O bound or processor bound the
scheduler used interactive metrics with numerous heuristics. Because the code was difficult to manage and the most part of
the code was to calculate heuristics, it was replaced in Linux 2.6.23 with the CFS scheduler, which is the current
scheduler in the actual Linux versions.

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining
fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor.
When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep
fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual
runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing
and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently.
Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu
are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The
scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted
into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side
of the tree are migrated to the left side to maintain fairness. [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html]

-- [[User:Sschnei1|Sschnei1]] 19:26, 9 October 2010 (UTC)

I've started writing a bit about the Linux O(1) scheduler:

Under a Linux system, scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.[http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt]

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.[http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726]

Here's something I put into the Linux: Overview section:
 I (Sschnei1) added some text to the Round-Robin Scheduling and the Multilevel Queue Scheduling.

The Linux kernel has undergone many changes over the decades since its original release as the UNIX operating system in 1969.[http://www.unix.com/whats-your-mind/110099-unix-40th-birthday.html] The early versions had relatively inefficient schedulers which operated in linear time with respect to the number of tasks to schedule; currently the Linux scheduler is able to operate in constant time, independent of the number of tasks being scheduled.

There are five basic algorithms for allocating CPU time[http://en.wikipedia.org/wiki/Scheduling_(computing)#Scheduling_disciplines]: <ul>
<li>First-in, First-out: No multi-tasking. Processes are queued in the order they are called. A process gets full, uninterrupted use of the CPU until it has finished running.</li>
<li>Shortest Time Remaining: Limited multi-tasking. The CPU handles the easiest tasks first, and complex, time-consuming tasks are handled last.</li>
<li>Fixed-Priority Preemptive Scheduling: Greater multi-tasking. Processes are assigned priority levels which are independent of their complexity. High-priority processes can be completed quickly, while low-priority processes can take a long time as new, higher-priority processes arrive and interrupt them.</li>
<li>Round-Robin Scheduling: Fair multi-tasking. This method is similar in concept to Fixed-Priority Preemptive Scheduling, but all processes are assigned the same priority level; that is, every running process is given an equal share of CPU time. The Round-Robin Scheduling is used in Linux-1.2.</li>
<li>Multilevel Queue Scheduling: Rule-based multi-taksing. This method is also similar to Fixed-Priority Preemptive Scheduling, but processes are associated with groups that help determine how high their priorities are. For example, all I/O tasks get low priority since much time is spent waiting for the user to interact with the system. The O(1) algorithm in 2.6 up to 2.6.23 is based on a Multilevel Queue.</li>
</ul>

-- [[User:abondio2|Austin Bondio]] Last edit: 22:27, 13 October 2010 (UTC)

-- [[User:Sschnei1|Sschnei1]] 00:24, 14 October 2010 (UTC)

I'm writing on a contrast of the CFS scheduler right now, please don't edit it.

In contrast the the O(1) scheduler, CFS realizes the model of a scheduler which can execute precise on real multitasking on real hardware. Precise multitasking means that each process can run at equal speed. If 4 processes are running at the same time, CFS assigns 25% of the CPU time to each process. On real hardware, only one task can be executed at a time and other tasks have to wait, which gives the running tasks an unfair amount of CPU time.

To avoid an unfair balance over the processes, CFS has a wait run-time for each process. CFS tries to pick the process with the highest wait run-time value. To provide a real multitasking, CFS splits up the CPU time between running processes. This allows multiple processes to parallel on a single CPU.

Processes are not stored in a run queue, such in the O(1) scheduler, but in a self-balancing red-black tree, where self-balancing means that the task with the highest need for CPU time is stored in the most left node. Tasks with a lower need for CPU time are stored on the right side of the Tree, where tasks with a higher need for CPU time are stored on the left side. The task on the left side is picked by the scheduler and put in a virtual runtime. If the process is ready to run, it is given CPU time to run. The tree re-balances itself and new tasks can be taken out by the CPU.

CFS is designed in a way that it does not need to do timeslicing on the CPU, and still provide most performance with as much CPU utilization. This is due to the nanosecond granularity, which removes the need for jiffies or other HZ details. [http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt]

-- [[User:Sschnei1|Sschnei1]] 16:32, 13 October 2010 (UTC)

Hey guys, sorry I've been non-existent for the past little bit, here's what I've done so far. I've been going through stuff on the 4BSD and ULE schedulers, here's what I have so far:

In order for FreeBSD to function, it requires a scheduler to be selected at the time the kernel is built. Also, all calls to scheduling code are resolved at compile time, meaning that the overhead of indirect function calls for scheduling decisions is eliminated.

[3] The 4BSD scheduler was a general-purpose scheduler. Its primary goal was to balance threads’ different scheduling requirements. FreeBSD's time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements and the amount consumed by the thread. Based on the thread's priority, it gets moved between run queues. When a new thread attains a higher priority than the currently running one, the system immediately switches to the new thread, if it's in user mode. Otherwise, the system switches as soon as the current thread leaves the kernel. The system scans the run queues in order of highest to lowest priority, and executes the first thread of the first non-empty run queue it finds. The system tailors it's short-term scheduling algorithm to favor user-interactive jobs by raising the priority of threads waiting for I/O for one or more seconds, and by lowering the priority of threads that hog up significant amounts of CPU time.

[1] In older BSD systems, (and I mean old, as in 20 or so years ago), a 1 second quantum was used for the round-robin scheduling algorithm. Later, in BSD 4.2, it did rescheduling every 0.1 seconds, and priority re-computation every second, and these values haven’t changed since. Round-robin scheduling is done by a timeout mechanism, which informs the clock interrupt driver to call a certain system routine after a specified interval. The subroutine to be called, in this case, causes the rescheduling and then resubmits a timeout to call itself again 0.1 sec later. The priority re-computation is also timed by a subroutine that resubmits a timeout for itself.

The ULE Scheduler was first introduced in FreeBSD 5, however disabled by default in favor of the default 4BSD scheduler. It was not until FreeBSD 7.1 that the ULE scheduler became the new default. The ULE scheduler was an overhaul of the original scheduler, and allowed it support for symmetric multiprocessing (SMP), support for symmetric multithreading (SMT) on multi-core systems, and improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

The ULE has a constant execution time of O(1), regardless of the number of threads. In addition, it also is careful to identify interactive tasks and give them the lowest latency response possible. The scheduling components include several queues, two CPU load-balancing algorithms, an interactivity scorer, a CPU usage estimator, a priority calculator, and a slice calculator.

Since ULE is an event driven scheduler, there is no periodic timer that adjusts thread priority to prevent starvation. Fairness is implemented by maintaining two queues: the current and the next queue. Each thread that is granted a CPU slice is assigned to either the current or next queue. Threads are then picked from the current queue in order of priority until it is empty, which is when the next and current queues are switched, and the process begins again. This guarantees that each thread will be given use of its slice once every two queue cycles, regardless of priority. Interrupt and real-time threads (and threads with these priority levels) are inserted onto the current queue, for they are of the highest priority. There is also an idle class of threads, which is checked only when there are no other runnable tasks.

In order to promptly discover when a thread changes from interactive to non-interactive, the ULE scheduler uses Interactivity Scoring, a key part of the scheduler affecting the responsiveness of the system, and subsequently, user experience. An interactivity score is computed from the relationship between run and sleep time, using a formula that is out of scope for this project. Interactive threads usually have high sleep times because they are often waiting for user input. This usually is followed by quick bursts in CPU activity from processing the user's request.

The ULE also implements a priority calculator, not to maintain fairness, but only to order the threads by priority. Only time-sharing threads use the priority calculator, the rest are allotted statically. ULE uses the interactivity score to determine the nice value of a thread, which will allow it to in turn decide upon its priority.

The way that the ULE uses its nice values in combination with slices is by implementing a moving window of nice values allowed slices. The threads within the window are given a slice oppositely proportional to the difference between their nice value and the lowest recorded nice value. This results in smaller slices to nicer threads, which subsequently defines their amount of allotted CPU time. On x86, FreeBSD allows for a minimum slice value of 10ms, and a maximum of 140ms. Interactive tasks receive the smallest nice value, in order to more promptly find that the interactive task is no longer interacting.

The ULE uses a CPU usage estimator to show roughly how much CPU a thread is using. It operates on an event driven basis. ULE keeps track of the number of clock ticks that occurred within a sliding window of the thread's execution time. The window slides to upwards of one second past threshold, and then back down to regulate the ratio of run time to sleep time.

The ULE enables SMP (Symmetric Multiprocessing) in order to greater achieve CPU affinity, which is when you schedule threads onto the last CPU they were run on, as to avoid unnecessary CPU migrations. Processors typically have large caches to aid the performance of threads and processes. CPU affinity is key because the thread may still have leftover data in the caches of the previous CPU, and when a thread migrates, it not only has to load this data into the new CPU's cache, but must also clear the data from the previous CPU's cache. ULE has two methods for CPU load balancing: pull and push. Pull is when an idle CPU grabs a thread from a non-idle CPU to lend a hand. Push is when a periodic task evaluates the current load situation of the CPUs and balances it out amongst them. These two function side by side, and allow for an optimally balanced CPU workload.

SMT (Symmetric Multithreading), a concept of non-uniform processors, is not fully present in ULE. The foundations are there, however, which can eventually be extended to support NUMA (Non-Uniform Memory Architecture). This involves expressing the penalties of CPU migration through separate queues, which could be extended to add a local and global load-balancing policy. As far as my sources go, FreeBSD does not at this point support NUMA, however the groundwork is there, and it is a real possibility for it to appear in a future version.

1 = http://www.cim.mcgill.ca/~franco/OpSys-304-427/lecture-notes/node46.html
2 = http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc
3 = McKusick, M. K. and Neville-Neil, G. V. 2004. Thread Scheduling in FreeBSD 5.2. Queue 2, 7 (Oct. 2004), 58-64. DOI= http://doi.acm.org/10.1145/1035594.1035622

Notes: Lots of this is just paraphrasing stuff you guys said in the discussion section. In terms of citations, should it be a superscripted citation next to the fact snippet we used, or should it just be a list of sources at the bottom?

--[[User:CFaibish|CFaibish]] 23:27, 13 October 2010 (UTC)

I would agree with putting superscripted citations that refer to the Sources section? How do they do it in the wikipedia?
-- [[User:Sschnei1|Sschnei1]] 18:52, 13 October 2010 (UTC)

Superscripted citations seems to be the best way to do it. If we cite URLs throughout the essay, it will be much harder to read. To put in a superscripted citation, enclose the URL of your source in square brackets.

Also, who here is actually good at writing, and can compile all these paragraphs into one nice essay for us? I think we have enough raw information here, it's just a matter of putting it all together now.

-- [[abondio2|Austin Bondio]] 20:39, 13 October 2010 (UTC)

Abhinav is putting something together right now on the main page.

-- [[User:Sschnei1|Sschnei1]] 20:56, 13 October 2010 (UTC)

Hi, here's a little forward on schedulers in relation to types of threads I've composed based off of one of my sources, I'm not sure if its necessary since there is one Mike typed above, but here it just for you guys to examine:

Threads that perform a lot of I/O require a fast response time to keep input and output devices busy, but need little CPU time. On the other hand, compute-bound threads need to receive a lot of CPU time to finish their work, but have no requirement for fast response time. Other threads lie somewhere in between, with periods of I/O punctuated by periods of computation, and thus have requirements that vary over time. A well-designed scheduler should be able accommodate threads with all these requirements simultaneously.

Also: as Mike said earlier about BSD's issue with locking mechanisms, should I go into greater detail about that, or just include a little, few sentence description of the issue? I've found a source for what I think is what he was referring to: http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc

I'll be posting more of what I've got on the BSD stuff under the hour.

--[[User:CFaibish|CFaibish]] 22:59, 13 October 2010 (UTC)

I'm reading through what's up there now looking for typos/rewording certain aspects [mostly typos tho, this is very well written]. Should I just go and edit it? Or should I compose a mirrored version and go from there?

--[[User:CFaibish|CFaibish]] 14:42, 14 October 2010 (UTC)

Since their just minimal fixes like commas, pluralization, and apostrophes, I'll just go ahead and edit them in-essay. I've saved a backup as well in case you guys feel I ruined anything.

= Sources =

[1] http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

[2] http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt

[3] http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726

[4] http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt

Talk:COMP 3000 Essay 1 2010 Question 5

2010-10-14T14:53:36Z

CFaibish: /* Essay Preview */

=Resources=

I just moved the Resources section to our discussion page --[[User:AbsMechanik|AbsMechanik]] 18:19, 13 October 2010 (UTC)

I found some resources, which might be useful to answer this question. As far as I know, FreeBSD uses a Multilevel feeback queue and Linux uses in the current version the completly fair scheduler.
 
-Some text about FreeBSD-scheduling http://www.informit.com/articles/article.aspx?p=366888&seqNum=4
 
-ULE Thread Scheduler: http://www.scribd.com/doc/3299978/ULE-Thread-Scheduler-for-FreeBSD
 
-Completly Fair Scheduler: http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt
 
-Brain Fuck Scheduler: http://en.wikipedia.org/wiki/Brain_Fuck_Scheduler
 
-Sebastian

Also found a nice link with regards to the new Linux Scheduler for those interested:
http://www.ibm.com/developerworks/linux/library/l-scheduler/
 It is also referred to as the O(1) scheduler in algorithmic terms (CFS is O(log(n)) scheduler). Both have been in development by Ingo Molnár.
-Abhinav

Some more resources; 
http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html (includes history of Linux scheduler from 1.2 to 2.6) 
http://my.opera.com/blu3c4t/blog/show.dml/1531517 
-Wes 

 
Information on changes to the O(1) scheduler: 
"Linux Kernel Documentation" 
http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt 
 
General information on Linux Job Scheduling: 
"Linux Job Scheduling | Linux Journal" 
http://www.linuxjournal.com/article/4087 
 
Scheduling on multi-core Linux machines: 
"Node affine NUMA scheduler for Linux" 
http://home.arcor.de/efocht/sched/ 
 
More on Linux process scheduling: 
"Understanding the Linux kernel" 
http://oreilly.com/catalog/linuxkernel/chapter/ch10.html 
 
FreeBSD thread scheduling: 
"InformIT: FreeBSD Process Management" 
http://www.informit.com/articles/article.aspx?p=366888&seqNum=4 
- Austin Bondio

=Discussion=

From what I have been reading the early versions of the Linux scheduler had a very hard time managing high numbers of tasks at the same time. Although I do not how it ran, the scheduler algorithm operated at O(n) time. As a result as more tasks were added, the scheduler would become slower. In addition to this, a single data structure was used to manage all processors of a system which created a problem with managing cached memory between processors. The Linux 2.6 scheduler was built to resolve the task management issues in O(1), constant, time as well as addressing the multiprocessing issues.

It appears as though BSD also had issues with task management however for BSD this was due to a locking mechanism that only allowed one process at a time to operate in kernel mode. FreeBSD 5 changed this locking mechanism to allow multiple processes the ability to run in kernel mode at the same time advancing the success of symmetric multiprocessing.

--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

Hi Mike,
Can you give any names for the schedulers you are talking about? I think it is easier to distinguish by names and not by the algorithm. It is just a suggestion!

The O(1) scheduler was replaced in the linux kernel 2.6.23 with the CFS (completly fair scheduler) which runs in O(log n). Also, the schedulers before CFS were based on a Multilevel feedback queue algorithm, which was changed in 2.6.23. It is not based on a queue as most schedulers, but on a red-black-tree to implement a timeline to make future predictions. The aim of CFS is to maximize CPU utilization and maximizing the performance at the same time.

In FreeBSD 5, the ULE Scheduler was introduced but disabled by default in the early versions, which eventually changed later on. ULE has better support for SMP and SMT, thus allowing it to improve overall performance in uniprocessors and multiprocessors. And it has a constant execution time, regardless of the amount of threads.

More information can be found here:
 
http://lwn.net/Articles/230574/
 
http://lwn.net/Articles/240474/

[[User:Sschnei1|Sschnei1]] 16:33, 3 October 2010 (UTC) or Sebastian

Here is another article which essentially backs up what you are saying Sebastian: http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156

Here are the highlights from the article:

General FreeBSD knowledge:
1. requires a scheduler to be selected at the time the kernel is built.
2. all calls to scheduling code are resolved at compile time...this means that the overhead of indirect function calls for scheduling decisions is eliminated.
3. kernels up to FreeBSD 5.1 used this scheduler, but from 5.2 onward the ULE scheduler used.

Original FreeBSD Scheduler:
1. threads assigned a scheduling priority which determines which 'run queue' the thread is placed in.
2. the system scans the run queues in order of highest priority to lowest priority and executes the first thread of the first non-empty run queue it finds.
3. once a non-empty queue is found the system spends an equal time slice on each thread in the run queue. This time slice is 0.1 seconds and this value has not changed in over 20 years. A shorter time slice would cause overhead due to switching between threads too often thus reducing productivity.
4. the article then provides detailed formulae on how to determine thread priority which is out of our scope for this project.

ULE Scheduler
- overhaul of Original BSD scheduler to:
1. support symmetric multiprocessing (SMP)
2. support symmetric multithreading (SMT) on multi-core systems
3. improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

Here is another article which gives some great overview of a bunch of versions/the evolution of different schedulers: https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf
 
Some interesting pieces about the Linux scheduler include:
1. The Jan 2002 version included O(1) algorithm as well as additions for SMP.
2. Scheduler uses 2 priority queue arrays to achieve fairness. Does this by giving each thread a time slice and a priority and executes each thread in order of highest priority to lowest. Threads that exhaust their time slice are moved to the exhausted queue and threads with remaining time slices are kept in the active queue.
3. Time slices are DYNAMIC, larger time slices are given to higher priority tasks, smaller slices to lower priority tasks.
 
I thought the dynamic time slice piece was of particular interest as you would think this would lead to starvation situations if the priority was high enough on one or multiple threads.
--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

This is essentially a summarized version of the aforementioned information regarding CFS (http://www.ibm.com/developerworks/linux/library/l-scheduler/).
--[[User:AbsMechanik|AbsMechanik]] 02:32, 4 October 2010 (UTC)

I have seen this website and thought it is useful. Do you think this is enough on research to write an essay or are we going to do some more research?
--[[User:Sschnei1|Sschnei1]] 09:38, 5 October 2010 (UTC)

I also stumbled upon this website: http://my.opera.com/blu3c4t/blog/show.dml/1531517. It explains a lot of stuff in layman's terms (I had a lot of trouble finding more info on the default BSD scheduler, but this link has some brief description included in it). I think we have enough resources/research done. We should start to formulate these results into an answer now. --[[User:AbsMechanik|AbsMechanik]] 20:08, 4 October 2010 (UTC)

So I thought I would take a first crack at an intro for our article, please tell me what you think of the following. Note that I have included the resource used as a footnote, the placement of which I indicate with the number 1, and I just tacked the details of the footnote on at the bottom:

See Essay preview section!

--[[User:Mike Preston|Mike Preston]] 02:54, 6 October 2010 (UTC)

I added a part to introduce the several schedulers for LINUX. We might need to change the reference, since I got it all from http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

-- [[User:Sschnei1|Sschnei1]] 19:27, 9 October 2010 (UTC)

Maybe we should write down our contact emails and names to write down who would like to write what part.

Another suggestion is that someone should read over the text and compare it to the references posted in the "Sources" section and check if someone is doing plagiarism.

Sebastian Schneider - sebastian@gamersblog.ca

= Essay Preview =

So just a small, quick question. Are we going to follow a certain standard for citing resources (bibliography & footnotes) to maintain consistency, or do we just stick with what Mike's presented?--[[User:AbsMechanik|AbsMechanik]] 12:53, 7 October 2010 (UTC)

Maybe we should write the essay templates/prototypes here, to keep overview of the discussion part.

Just relocating previous post with suggested intro paragraph:

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system.1 As computer hardware has increased in complexity, for example multiple core CPUs, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers; the default BSD/FreeBSD and Linux schedulers.

1 Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

--[[User:Mike Preston|Mike Preston]] 03:48, 7 October 2010 (UTC)

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be
efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea
of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was
the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its
predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a
scheduling event. The scheduler divided tasks into epochs, allowing each tasks to execute up to its time slice. If a task
did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute
longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and
did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware
architectures, such as multi-core processors.

Linux-2.6 introduced another scheduler up to Linux 2.6.23. Before Linux 2.6.23 an O(1) scheduler was used. It needed the
same amount of time for each task to execute, independent of how big the tasks were.It kept track of the tasks in a
running queue. The scheduler offered much more scalability. To determine if a task was I/O bound or processor bound the
scheduler used interactive metrics with numerous heuristics. Because the code was difficult to manage and the most part of
the code was to calculate heuristics, it was replaced in Linux 2.6.23 with the CFS scheduler, which is the current
scheduler in the actual Linux versions.

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining
fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor.
When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep
fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual
runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing
and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently.
Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu
are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The
scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted
into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side
of the tree are migrated to the left side to maintain fairness. [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html]

-- [[User:Sschnei1|Sschnei1]] 19:26, 9 October 2010 (UTC)

I've started writing a bit about the Linux O(1) scheduler:

Under a Linux system, scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.[http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt]

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.[http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726]

Here's something I put into the Linux: Overview section:
 I (Sschnei1) added some text to the Round-Robin Scheduling and the Multilevel Queue Scheduling.

The Linux kernel has undergone many changes over the decades since its original release as the UNIX operating system in 1969.[http://www.unix.com/whats-your-mind/110099-unix-40th-birthday.html] The early versions had relatively inefficient schedulers which operated in linear time with respect to the number of tasks to schedule; currently the Linux scheduler is able to operate in constant time, independent of the number of tasks being scheduled.

There are five basic algorithms for allocating CPU time[http://en.wikipedia.org/wiki/Scheduling_(computing)#Scheduling_disciplines]: <ul>
<li>First-in, First-out: No multi-tasking. Processes are queued in the order they are called. A process gets full, uninterrupted use of the CPU until it has finished running.</li>
<li>Shortest Time Remaining: Limited multi-tasking. The CPU handles the easiest tasks first, and complex, time-consuming tasks are handled last.</li>
<li>Fixed-Priority Preemptive Scheduling: Greater multi-tasking. Processes are assigned priority levels which are independent of their complexity. High-priority processes can be completed quickly, while low-priority processes can take a long time as new, higher-priority processes arrive and interrupt them.</li>
<li>Round-Robin Scheduling: Fair multi-tasking. This method is similar in concept to Fixed-Priority Preemptive Scheduling, but all processes are assigned the same priority level; that is, every running process is given an equal share of CPU time. The Round-Robin Scheduling is used in Linux-1.2.</li>
<li>Multilevel Queue Scheduling: Rule-based multi-taksing. This method is also similar to Fixed-Priority Preemptive Scheduling, but processes are associated with groups that help determine how high their priorities are. For example, all I/O tasks get low priority since much time is spent waiting for the user to interact with the system. The O(1) algorithm in 2.6 up to 2.6.23 is based on a Multilevel Queue.</li>
</ul>

-- [[User:abondio2|Austin Bondio]] Last edit: 22:27, 13 October 2010 (UTC)

-- [[User:Sschnei1|Sschnei1]] 00:24, 14 October 2010 (UTC)

I'm writing on a contrast of the CFS scheduler right now, please don't edit it.

In contrast the the O(1) scheduler, CFS realizes the model of a scheduler which can execute precise on real multitasking on real hardware. Precise multitasking means that each process can run at equal speed. If 4 processes are running at the same time, CFS assigns 25% of the CPU time to each process. On real hardware, only one task can be executed at a time and other tasks have to wait, which gives the running tasks an unfair amount of CPU time.

To avoid an unfair balance over the processes, CFS has a wait run-time for each process. CFS tries to pick the process with the highest wait run-time value. To provide a real multitasking, CFS splits up the CPU time between running processes. This allows multiple processes to parallel on a single CPU.

Processes are not stored in a run queue, such in the O(1) scheduler, but in a self-balancing red-black tree, where self-balancing means that the task with the highest need for CPU time is stored in the most left node. Tasks with a lower need for CPU time are stored on the right side of the Tree, where tasks with a higher need for CPU time are stored on the left side. The task on the left side is picked by the scheduler and put in a virtual runtime. If the process is ready to run, it is given CPU time to run. The tree re-balances itself and new tasks can be taken out by the CPU.

CFS is designed in a way that it does not need to do timeslicing on the CPU, and still provide most performance with as much CPU utilization. This is due to the nanosecond granularity, which removes the need for jiffies or other HZ details. [http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt]

-- [[User:Sschnei1|Sschnei1]] 16:32, 13 October 2010 (UTC)

Hey guys, sorry I've been non-existent for the past little bit, here's what I've done so far. I've been going through stuff on the 4BSD and ULE schedulers, here's what I have so far:

In order for FreeBSD to function, it requires a scheduler to be selected at the time the kernel is built. Also, all calls to scheduling code are resolved at compile time, meaning that the overhead of indirect function calls for scheduling decisions is eliminated.

[3] The 4BSD scheduler was a general-purpose scheduler. Its primary goal was to balance threads’ different scheduling requirements. FreeBSD's time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements and the amount consumed by the thread. Based on the thread's priority, it gets moved between run queues. When a new thread attains a higher priority than the currently running one, the system immediately switches to the new thread, if it's in user mode. Otherwise, the system switches as soon as the current thread leaves the kernel. The system scans the run queues in order of highest to lowest priority, and executes the first thread of the first non-empty run queue it finds. The system tailors it's short-term scheduling algorithm to favor user-interactive jobs by raising the priority of threads waiting for I/O for one or more seconds, and by lowering the priority of threads that hog up significant amounts of CPU time.

[1] In older BSD systems, (and I mean old, as in 20 or so years ago), a 1 second quantum was used for the round-robin scheduling algorithm. Later, in BSD 4.2, it did rescheduling every 0.1 seconds, and priority re-computation every second, and these values haven’t changed since. Round-robin scheduling is done by a timeout mechanism, which informs the clock interrupt driver to call a certain system routine after a specified interval. The subroutine to be called, in this case, causes the rescheduling and then resubmits a timeout to call itself again 0.1 sec later. The priority re-computation is also timed by a subroutine that resubmits a timeout for itself.

The ULE Scheduler was first introduced in FreeBSD 5, however disabled by default in favor of the default 4BSD scheduler. It was not until FreeBSD 7.1 that the ULE scheduler became the new default. The ULE scheduler was an overhaul of the original scheduler, and allowed it support for symmetric multiprocessing (SMP), support for symmetric multithreading (SMT) on multi-core systems, and improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

The ULE has a constant execution time of O(1), regardless of the number of threads. In addition, it also is careful to identify interactive tasks and give them the lowest latency response possible. The scheduling components include several queues, two CPU load-balancing algorithms, an interactivity scorer, a CPU usage estimator, a priority calculator, and a slice calculator.

Since ULE is an event driven scheduler, there is no periodic timer that adjusts thread priority to prevent starvation. Fairness is implemented by maintaining two queues: the current and the next queue. Each thread that is granted a CPU slice is assigned to either the current or next queue. Threads are then picked from the current queue in order of priority until it is empty, which is when the next and current queues are switched, and the process begins again. This guarantees that each thread will be given use of its slice once every two queue cycles, regardless of priority. Interrupt and real-time threads (and threads with these priority levels) are inserted onto the current queue, for they are of the highest priority. There is also an idle class of threads, which is checked only when there are no other runnable tasks.

In order to promptly discover when a thread changes from interactive to non-interactive, the ULE scheduler uses Interactivity Scoring, a key part of the scheduler affecting the responsiveness of the system, and subsequently, user experience. An interactivity score is computed from the relationship between run and sleep time, using a formula that is out of scope for this project. Interactive threads usually have high sleep times because they are often waiting for user input. This usually is followed by quick bursts in CPU activity from processing the user's request.

The ULE also implements a priority calculator, not to maintain fairness, but only to order the threads by priority. Only time-sharing threads use the priority calculator, the rest are allotted statically. ULE uses the interactivity score to determine the nice value of a thread, which will allow it to in turn decide upon its priority.

The way that the ULE uses its nice values in combination with slices is by implementing a moving window of nice values allowed slices. The threads within the window are given a slice oppositely proportional to the difference between their nice value and the lowest recorded nice value. This results in smaller slices to nicer threads, which subsequently defines their amount of allotted CPU time. On x86, FreeBSD allows for a minimum slice value of 10ms, and a maximum of 140ms. Interactive tasks receive the smallest nice value, in order to more promptly find that the interactive task is no longer interacting.

The ULE uses a CPU usage estimator to show roughly how much CPU a thread is using. It operates on an event driven basis. ULE keeps track of the number of clock ticks that occurred within a sliding window of the thread's execution time. The window slides to upwards of one second past threshold, and then back down to regulate the ratio of run time to sleep time.

The ULE enables SMP (Symmetric Multiprocessing) in order to greater achieve CPU affinity, which is when you schedule threads onto the last CPU they were run on, as to avoid unnecessary CPU migrations. Processors typically have large caches to aid the performance of threads and processes. CPU affinity is key because the thread may still have leftover data in the caches of the previous CPU, and when a thread migrates, it not only has to load this data into the new CPU's cache, but must also clear the data from the previous CPU's cache. ULE has two methods for CPU load balancing: pull and push. Pull is when an idle CPU grabs a thread from a non-idle CPU to lend a hand. Push is when a periodic task evaluates the current load situation of the CPUs and balances it out amongst them. These two function side by side, and allow for an optimally balanced CPU workload.

SMT (Symmetric Multithreading), a concept of non-uniform processors, is not fully present in ULE. The foundations are there, however, which can eventually be extended to support NUMA (Non-Uniform Memory Architecture). This involves expressing the penalties of CPU migration through separate queues, which could be extended to add a local and global load-balancing policy. As far as my sources go, FreeBSD does not at this point support NUMA, however the groundwork is there, and it is a real possibility for it to appear in a future version.

1 = http://www.cim.mcgill.ca/~franco/OpSys-304-427/lecture-notes/node46.html
2 = http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc
3 = McKusick, M. K. and Neville-Neil, G. V. 2004. Thread Scheduling in FreeBSD 5.2. Queue 2, 7 (Oct. 2004), 58-64. DOI= http://doi.acm.org/10.1145/1035594.1035622

Notes: Lots of this is just paraphrasing stuff you guys said in the discussion section. In terms of citations, should it be a superscripted citation next to the fact snippet we used, or should it just be a list of sources at the bottom?

--[[User:CFaibish|CFaibish]] 23:27, 13 October 2010 (UTC)

I would agree with putting superscripted citations that refer to the Sources section? How do they do it in the wikipedia?
-- [[User:Sschnei1|Sschnei1]] 18:52, 13 October 2010 (UTC)

Superscripted citations seems to be the best way to do it. If we cite URLs throughout the essay, it will be much harder to read. To put in a superscripted citation, enclose the URL of your source in square brackets.

Also, who here is actually good at writing, and can compile all these paragraphs into one nice essay for us? I think we have enough raw information here, it's just a matter of putting it all together now.

-- [[abondio2|Austin Bondio]] 20:39, 13 October 2010 (UTC)

Abhinav is putting something together right now on the main page.

-- [[User:Sschnei1|Sschnei1]] 20:56, 13 October 2010 (UTC)

Hi, here's a little forward on schedulers in relation to types of threads I've composed based off of one of my sources, I'm not sure if its necessary since there is one Mike typed above, but here it just for you guys to examine:

Threads that perform a lot of I/O require a fast response time to keep input and output devices busy, but need little CPU time. On the other hand, compute-bound threads need to receive a lot of CPU time to finish their work, but have no requirement for fast response time. Other threads lie somewhere in between, with periods of I/O punctuated by periods of computation, and thus have requirements that vary over time. A well-designed scheduler should be able accommodate threads with all these requirements simultaneously.

Also: as Mike said earlier about BSD's issue with locking mechanisms, should I go into greater detail about that, or just include a little, few sentence description of the issue? I've found a source for what I think is what he was referring to: http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc

I'll be posting more of what I've got on the BSD stuff under the hour.

--[[User:CFaibish|CFaibish]] 22:59, 13 October 2010 (UTC)

I'm reading through what's up there now looking for typos/rewording certain aspects [mostly typos tho, this is very well written]. Should I just go and edit it? Or should I compose a mirrored version and go from there?

--[[User:CFaibish|CFaibish]] 14:42, 14 October 2010 (UTC)

= Sources =

[1] http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

[2] http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt

[3] http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726

[4] http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt

Talk:COMP 3000 Essay 1 2010 Question 5

2010-10-14T14:42:54Z

CFaibish: /* Essay Preview */

=Resources=

I just moved the Resources section to our discussion page --[[User:AbsMechanik|AbsMechanik]] 18:19, 13 October 2010 (UTC)

I found some resources, which might be useful to answer this question. As far as I know, FreeBSD uses a Multilevel feeback queue and Linux uses in the current version the completly fair scheduler.
 
-Some text about FreeBSD-scheduling http://www.informit.com/articles/article.aspx?p=366888&seqNum=4
 
-ULE Thread Scheduler: http://www.scribd.com/doc/3299978/ULE-Thread-Scheduler-for-FreeBSD
 
-Completly Fair Scheduler: http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt
 
-Brain Fuck Scheduler: http://en.wikipedia.org/wiki/Brain_Fuck_Scheduler
 
-Sebastian

Also found a nice link with regards to the new Linux Scheduler for those interested:
http://www.ibm.com/developerworks/linux/library/l-scheduler/
 It is also referred to as the O(1) scheduler in algorithmic terms (CFS is O(log(n)) scheduler). Both have been in development by Ingo Molnár.
-Abhinav

Some more resources; 
http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html (includes history of Linux scheduler from 1.2 to 2.6) 
http://my.opera.com/blu3c4t/blog/show.dml/1531517 
-Wes 

 
Information on changes to the O(1) scheduler: 
"Linux Kernel Documentation" 
http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt 
 
General information on Linux Job Scheduling: 
"Linux Job Scheduling | Linux Journal" 
http://www.linuxjournal.com/article/4087 
 
Scheduling on multi-core Linux machines: 
"Node affine NUMA scheduler for Linux" 
http://home.arcor.de/efocht/sched/ 
 
More on Linux process scheduling: 
"Understanding the Linux kernel" 
http://oreilly.com/catalog/linuxkernel/chapter/ch10.html 
 
FreeBSD thread scheduling: 
"InformIT: FreeBSD Process Management" 
http://www.informit.com/articles/article.aspx?p=366888&seqNum=4 
- Austin Bondio

=Discussion=

From what I have been reading the early versions of the Linux scheduler had a very hard time managing high numbers of tasks at the same time. Although I do not how it ran, the scheduler algorithm operated at O(n) time. As a result as more tasks were added, the scheduler would become slower. In addition to this, a single data structure was used to manage all processors of a system which created a problem with managing cached memory between processors. The Linux 2.6 scheduler was built to resolve the task management issues in O(1), constant, time as well as addressing the multiprocessing issues.

It appears as though BSD also had issues with task management however for BSD this was due to a locking mechanism that only allowed one process at a time to operate in kernel mode. FreeBSD 5 changed this locking mechanism to allow multiple processes the ability to run in kernel mode at the same time advancing the success of symmetric multiprocessing.

--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

Hi Mike,
Can you give any names for the schedulers you are talking about? I think it is easier to distinguish by names and not by the algorithm. It is just a suggestion!

The O(1) scheduler was replaced in the linux kernel 2.6.23 with the CFS (completly fair scheduler) which runs in O(log n). Also, the schedulers before CFS were based on a Multilevel feedback queue algorithm, which was changed in 2.6.23. It is not based on a queue as most schedulers, but on a red-black-tree to implement a timeline to make future predictions. The aim of CFS is to maximize CPU utilization and maximizing the performance at the same time.

In FreeBSD 5, the ULE Scheduler was introduced but disabled by default in the early versions, which eventually changed later on. ULE has better support for SMP and SMT, thus allowing it to improve overall performance in uniprocessors and multiprocessors. And it has a constant execution time, regardless of the amount of threads.

More information can be found here:
 
http://lwn.net/Articles/230574/
 
http://lwn.net/Articles/240474/

[[User:Sschnei1|Sschnei1]] 16:33, 3 October 2010 (UTC) or Sebastian

Here is another article which essentially backs up what you are saying Sebastian: http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156

Here are the highlights from the article:

General FreeBSD knowledge:
1. requires a scheduler to be selected at the time the kernel is built.
2. all calls to scheduling code are resolved at compile time...this means that the overhead of indirect function calls for scheduling decisions is eliminated.
3. kernels up to FreeBSD 5.1 used this scheduler, but from 5.2 onward the ULE scheduler used.

Original FreeBSD Scheduler:
1. threads assigned a scheduling priority which determines which 'run queue' the thread is placed in.
2. the system scans the run queues in order of highest priority to lowest priority and executes the first thread of the first non-empty run queue it finds.
3. once a non-empty queue is found the system spends an equal time slice on each thread in the run queue. This time slice is 0.1 seconds and this value has not changed in over 20 years. A shorter time slice would cause overhead due to switching between threads too often thus reducing productivity.
4. the article then provides detailed formulae on how to determine thread priority which is out of our scope for this project.

ULE Scheduler
- overhaul of Original BSD scheduler to:
1. support symmetric multiprocessing (SMP)
2. support symmetric multithreading (SMT) on multi-core systems
3. improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

Here is another article which gives some great overview of a bunch of versions/the evolution of different schedulers: https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf
 
Some interesting pieces about the Linux scheduler include:
1. The Jan 2002 version included O(1) algorithm as well as additions for SMP.
2. Scheduler uses 2 priority queue arrays to achieve fairness. Does this by giving each thread a time slice and a priority and executes each thread in order of highest priority to lowest. Threads that exhaust their time slice are moved to the exhausted queue and threads with remaining time slices are kept in the active queue.
3. Time slices are DYNAMIC, larger time slices are given to higher priority tasks, smaller slices to lower priority tasks.
 
I thought the dynamic time slice piece was of particular interest as you would think this would lead to starvation situations if the priority was high enough on one or multiple threads.
--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

This is essentially a summarized version of the aforementioned information regarding CFS (http://www.ibm.com/developerworks/linux/library/l-scheduler/).
--[[User:AbsMechanik|AbsMechanik]] 02:32, 4 October 2010 (UTC)

I have seen this website and thought it is useful. Do you think this is enough on research to write an essay or are we going to do some more research?
--[[User:Sschnei1|Sschnei1]] 09:38, 5 October 2010 (UTC)

I also stumbled upon this website: http://my.opera.com/blu3c4t/blog/show.dml/1531517. It explains a lot of stuff in layman's terms (I had a lot of trouble finding more info on the default BSD scheduler, but this link has some brief description included in it). I think we have enough resources/research done. We should start to formulate these results into an answer now. --[[User:AbsMechanik|AbsMechanik]] 20:08, 4 October 2010 (UTC)

So I thought I would take a first crack at an intro for our article, please tell me what you think of the following. Note that I have included the resource used as a footnote, the placement of which I indicate with the number 1, and I just tacked the details of the footnote on at the bottom:

See Essay preview section!

--[[User:Mike Preston|Mike Preston]] 02:54, 6 October 2010 (UTC)

I added a part to introduce the several schedulers for LINUX. We might need to change the reference, since I got it all from http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

-- [[User:Sschnei1|Sschnei1]] 19:27, 9 October 2010 (UTC)

Maybe we should write down our contact emails and names to write down who would like to write what part.

Another suggestion is that someone should read over the text and compare it to the references posted in the "Sources" section and check if someone is doing plagiarism.

Sebastian Schneider - sebastian@gamersblog.ca

= Essay Preview =

So just a small, quick question. Are we going to follow a certain standard for citing resources (bibliography & footnotes) to maintain consistency, or do we just stick with what Mike's presented?--[[User:AbsMechanik|AbsMechanik]] 12:53, 7 October 2010 (UTC)

Maybe we should write the essay templates/prototypes here, to keep overview of the discussion part.

Just relocating previous post with suggested intro paragraph:

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system.1 As computer hardware has increased in complexity, for example multiple core CPUs, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers; the default BSD/FreeBSD and Linux schedulers.

1 Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

--[[User:Mike Preston|Mike Preston]] 03:48, 7 October 2010 (UTC)

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be
efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea
of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was
the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its
predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a
scheduling event. The scheduler divided tasks into epochs, allowing each tasks to execute up to its time slice. If a task
did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute
longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and
did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware
architectures, such as multi-core processors.

Linux-2.6 introduced another scheduler up to Linux 2.6.23. Before Linux 2.6.23 an O(1) scheduler was used. It needed the
same amount of time for each task to execute, independent of how big the tasks were.It kept track of the tasks in a
running queue. The scheduler offered much more scalability. To determine if a task was I/O bound or processor bound the
scheduler used interactive metrics with numerous heuristics. Because the code was difficult to manage and the most part of
the code was to calculate heuristics, it was replaced in Linux 2.6.23 with the CFS scheduler, which is the current
scheduler in the actual Linux versions.

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining
fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor.
When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep
fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual
runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing
and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently.
Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu
are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The
scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted
into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side
of the tree are migrated to the left side to maintain fairness. [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html]

-- [[User:Sschnei1|Sschnei1]] 19:26, 9 October 2010 (UTC)

I've started writing a bit about the Linux O(1) scheduler:

Under a Linux system, scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.[http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt]

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.[http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726]

Here's something I put into the Linux: Overview section:
 I (Sschnei1) added some text to the Round-Robin Scheduling and the Multilevel Queue Scheduling.

The Linux kernel has undergone many changes over the decades since its original release as the UNIX operating system in 1969.[http://www.unix.com/whats-your-mind/110099-unix-40th-birthday.html] The early versions had relatively inefficient schedulers which operated in linear time with respect to the number of tasks to schedule; currently the Linux scheduler is able to operate in constant time, independent of the number of tasks being scheduled.

There are five basic algorithms for allocating CPU time[http://en.wikipedia.org/wiki/Scheduling_(computing)#Scheduling_disciplines]: <ul>
<li>First-in, First-out: No multi-tasking. Processes are queued in the order they are called. A process gets full, uninterrupted use of the CPU until it has finished running.</li>
<li>Shortest Time Remaining: Limited multi-tasking. The CPU handles the easiest tasks first, and complex, time-consuming tasks are handled last.</li>
<li>Fixed-Priority Preemptive Scheduling: Greater multi-tasking. Processes are assigned priority levels which are independent of their complexity. High-priority processes can be completed quickly, while low-priority processes can take a long time as new, higher-priority processes arrive and interrupt them.</li>
<li>Round-Robin Scheduling: Fair multi-tasking. This method is similar in concept to Fixed-Priority Preemptive Scheduling, but all processes are assigned the same priority level; that is, every running process is given an equal share of CPU time. The Round-Robin Scheduling is used in Linux-1.2.</li>
<li>Multilevel Queue Scheduling: Rule-based multi-taksing. This method is also similar to Fixed-Priority Preemptive Scheduling, but processes are associated with groups that help determine how high their priorities are. For example, all I/O tasks get low priority since much time is spent waiting for the user to interact with the system. The O(1) algorithm in 2.6 up to 2.6.23 is based on a Multilevel Queue.</li>
</ul>

-- [[User:abondio2|Austin Bondio]] Last edit: 22:27, 13 October 2010 (UTC)

-- [[User:Sschnei1|Sschnei1]] 00:24, 14 October 2010 (UTC)

I'm writing on a contrast of the CFS scheduler right now, please don't edit it.

In contrast the the O(1) scheduler, CFS realizes the model of a scheduler which can execute precise on real multitasking on real hardware. Precise multitasking means that each process can run at equal speed. If 4 processes are running at the same time, CFS assigns 25% of the CPU time to each process. On real hardware, only one task can be executed at a time and other tasks have to wait, which gives the running tasks an unfair amount of CPU time.

To avoid an unfair balance over the processes, CFS has a wait run-time for each process. CFS tries to pick the process with the highest wait run-time value. To provide a real multitasking, CFS splits up the CPU time between running processes. This allows multiple processes to parallel on a single CPU.

Processes are not stored in a run queue, such in the O(1) scheduler, but in a self-balancing red-black tree, where self-balancing means that the task with the highest need for CPU time is stored in the most left node. Tasks with a lower need for CPU time are stored on the right side of the Tree, where tasks with a higher need for CPU time are stored on the left side. The task on the left side is picked by the scheduler and put in a virtual runtime. If the process is ready to run, it is given CPU time to run. The tree re-balances itself and new tasks can be taken out by the CPU.

CFS is designed in a way that it does not need to do timeslicing on the CPU, and still provide most performance with as much CPU utilization. This is due to the nanosecond granularity, which removes the need for jiffies or other HZ details. [http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt]

-- [[User:Sschnei1|Sschnei1]] 16:32, 13 October 2010 (UTC)

Hey guys, sorry I've been non-existent for the past little bit, here's what I've done so far. I've been going through stuff on the 4BSD and ULE schedulers, here's what I have so far:

In order for FreeBSD to function, it requires a scheduler to be selected at the time the kernel is built. Also, all calls to scheduling code are resolved at compile time, meaning that the overhead of indirect function calls for scheduling decisions is eliminated.

[3] The 4BSD scheduler was a general-purpose scheduler. Its primary goal was to balance threads’ different scheduling requirements. FreeBSD's time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements and the amount consumed by the thread. Based on the thread's priority, it gets moved between run queues. When a new thread attains a higher priority than the currently running one, the system immediately switches to the new thread, if it's in user mode. Otherwise, the system switches as soon as the current thread leaves the kernel. The system scans the run queues in order of highest to lowest priority, and executes the first thread of the first non-empty run queue it finds. The system tailors it's short-term scheduling algorithm to favor user-interactive jobs by raising the priority of threads waiting for I/O for one or more seconds, and by lowering the priority of threads that hog up significant amounts of CPU time.

[1] In older BSD systems, (and I mean old, as in 20 or so years ago), a 1 second quantum was used for the round-robin scheduling algorithm. Later, in BSD 4.2, it did rescheduling every 0.1 seconds, and priority re-computation every second, and these values haven’t changed since. Round-robin scheduling is done by a timeout mechanism, which informs the clock interrupt driver to call a certain system routine after a specified interval. The subroutine to be called, in this case, causes the rescheduling and then resubmits a timeout to call itself again 0.1 sec later. The priority re-computation is also timed by a subroutine that resubmits a timeout for itself.

The ULE Scheduler was first introduced in FreeBSD 5, however disabled by default in favor of the default 4BSD scheduler. It was not until FreeBSD 7.1 that the ULE scheduler became the new default. The ULE scheduler was an overhaul of the original scheduler, and allowed it support for symmetric multiprocessing (SMP), support for symmetric multithreading (SMT) on multi-core systems, and improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

The ULE has a constant execution time of O(1), regardless of the number of threads. In addition, it also is careful to identify interactive tasks and give them the lowest latency response possible. The scheduling components include several queues, two CPU load-balancing algorithms, an interactivity scorer, a CPU usage estimator, a priority calculator, and a slice calculator.

Since ULE is an event driven scheduler, there is no periodic timer that adjusts thread priority to prevent starvation. Fairness is implemented by maintaining two queues: the current and the next queue. Each thread that is granted a CPU slice is assigned to either the current or next queue. Threads are then picked from the current queue in order of priority until it is empty, which is when the next and current queues are switched, and the process begins again. This guarantees that each thread will be given use of its slice once every two queue cycles, regardless of priority. Interrupt and real-time threads (and threads with these priority levels) are inserted onto the current queue, for they are of the highest priority. There is also an idle class of threads, which is checked only when there are no other runnable tasks.

In order to promptly discover when a thread changes from interactive to non-interactive, the ULE scheduler uses Interactivity Scoring, a key part of the scheduler affecting the responsiveness of the system, and subsequently, user experience. An interactivity score is computed from the relationship between run and sleep time, using a formula that is out of scope for this project. Interactive threads usually have high sleep times because they are often waiting for user input. This usually is followed by quick bursts in CPU activity from processing the user's request.

The ULE also implements a priority calculator, not to maintain fairness, but only to order the threads by priority. Only time-sharing threads use the priority calculator, the rest are allotted statically. ULE uses the interactivity score to determine the nice value of a thread, which will allow it to in turn decide upon its priority.

The way that the ULE uses its nice values in combination with slices is by implementing a moving window of nice values allowed slices. The threads within the window are given a slice oppositely proportional to the difference between their nice value and the lowest recorded nice value. This results in smaller slices to nicer threads, which subsequently defines their amount of allotted CPU time. On x86, FreeBSD allows for a minimum slice value of 10ms, and a maximum of 140ms. Interactive tasks receive the smallest nice value, in order to more promptly find that the interactive task is no longer interacting.

The ULE uses a CPU usage estimator to show roughly how much CPU a thread is using. It operates on an event driven basis. ULE keeps track of the number of clock ticks that occurred within a sliding window of the thread's execution time. The window slides to upwards of one second past threshold, and then back down to regulate the ratio of run time to sleep time.

The ULE enables SMP (Symmetric Multiprocessing) in order to greater achieve CPU affinity, which is when you schedule threads onto the last CPU they were run on, as to avoid unnecessary CPU migrations. Processors typically have large caches to aid the performance of threads and processes. CPU affinity is key because the thread may still have leftover data in the caches of the previous CPU, and when a thread migrates, it not only has to load this data into the new CPU's cache, but must also clear the data from the previous CPU's cache. ULE has two methods for CPU load balancing: pull and push. Pull is when an idle CPU grabs a thread from a non-idle CPU to lend a hand. Push is when a periodic task evaluates the current load situation of the CPUs and balances it out amongst them. These two function side by side, and allow for an optimally balanced CPU workload.

SMT (Symmetric Multithreading), a concept of non-uniform processors, is not fully present in ULE. The foundations are there, however, which can eventually be extended to support NUMA (Non-Uniform Memory Architecture). This involves expressing the penalties of CPU migration through separate queues, which could be extended to add a local and global load-balancing policy. As far as my sources go, FreeBSD does not at this point support NUMA, however the groundwork is there, and it is a real possibility for it to appear in a future version.

1 = http://www.cim.mcgill.ca/~franco/OpSys-304-427/lecture-notes/node46.html
2 = http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc
3 = McKusick, M. K. and Neville-Neil, G. V. 2004. Thread Scheduling in FreeBSD 5.2. Queue 2, 7 (Oct. 2004), 58-64. DOI= http://doi.acm.org/10.1145/1035594.1035622

Notes: Lots of this is just paraphrasing stuff you guys said in the discussion section. In terms of citations, should it be a superscripted citation next to the fact snippet we used, or should it just be a list of sources at the bottom?

--[[User:CFaibish|CFaibish]] 23:27, 13 October 2010 (UTC)

I would agree with putting superscripted citations that refer to the Sources section? How do they do it in the wikipedia?
-- [[User:Sschnei1|Sschnei1]] 18:52, 13 October 2010 (UTC)

Superscripted citations seems to be the best way to do it. If we cite URLs throughout the essay, it will be much harder to read. To put in a superscripted citation, enclose the URL of your source in square brackets.

Also, who here is actually good at writing, and can compile all these paragraphs into one nice essay for us? I think we have enough raw information here, it's just a matter of putting it all together now.

-- [[abondio2|Austin Bondio]] 20:39, 13 October 2010 (UTC)

Abhinav is putting something together right now on the main page.

-- [[User:Sschnei1|Sschnei1]] 20:56, 13 October 2010 (UTC)

Hi, here's a little forward on schedulers in relation to types of threads I've composed based off of one of my sources, I'm not sure if its necessary since there is one Mike typed above, but here it just for you guys to examine:

Threads that perform a lot of I/O require a fast response time to keep input and output devices busy, but need little CPU time. On the other hand, compute-bound threads need to receive a lot of CPU time to finish their work, but have no requirement for fast response time. Other threads lie somewhere in between, with periods of I/O punctuated by periods of computation, and thus have requirements that vary over time. A well-designed scheduler should be able accommodate threads with all these requirements simultaneously.

Also: as Mike said earlier about BSD's issue with locking mechanisms, should I go into greater detail about that, or just include a little, few sentence description of the issue? I've found a source for what I think is what he was referring to: http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc

I'll be posting more of what I've got on the BSD stuff under the hour.

--[[User:CFaibish|CFaibish]] 22:59, 13 October 2010 (UTC)

I'm reading through what's up there now looking for typos/rewording certain aspects. Should I just go and edit it? Or should I compose a mirrored version and go from there?
--[[User:CFaibish|CFaibish]] 14:42, 14 October 2010 (UTC)

= Sources =

[1] http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

[2] http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt

[3] http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726

[4] http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt

COMP 3000 Essay 1 2010 Question 5

2010-10-14T14:39:31Z

CFaibish: /* Resources */

=Question=

Compare and contrast the evolution of the default BSD/FreeBSD and Linux schedulers.

=Answer=

(This work belongs to --[[User:Mike Preston|Mike Preston]] 23:01, 13 October 2010 (UTC))
(modified by --[[User:AbsMechanik|AbsMechanik]] 03:00, 14 October 2010 (UTC))

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system (Jensen: 1985).
There are several different algorithms which are utilized in different schedulers, but a few key algorithms are outlined below[http://joshaas.net/linux/linux_cpu_scheduler.pdf][http://www.sci.csueastbay.edu/~billard/cs4560/node6.html][http://www.articles.assyriancafe.com/documents/CPU_Scheduling.pdf]: <ul>

<li>First-Come, First-Serve (also known as FIFO): No multi-tasking. Processes are queued in the order they are called. A process gets full, uninterrupted use of the CPU until it has finished running.</li>

<li>Shortest Job First (similar to Shortest Remaining Time and/or Shortest Process Next): Limited multi-tasking. The CPU handles the easiest tasks first, and complex, time-consuming tasks are handled last.</li>

<li>Fixed-Priority Preemptive Scheduling: Greater multi-tasking. Processes are assigned priority levels which are independent of their complexity. High-priority processes can be completed quickly, while low-priority processes can take a long time as new, higher-priority processes arrive and interrupt them.</li>

<li>Round-Robin Scheduling: Fair multi-tasking. This method is similar in concept to First-Come, First-Serve, but all processes are assigned the same priority level; that is, every running process is given an equal share of CPU time.</li>

<li>Multilevel Feedback Queue Scheduling: Rule-based multi-tasking. It is a combination of First-Come, First-Serve, Round-Robin & Fixed-Priority Preemptive Scheduling, but processes are associated with groups that help determine how high their priorities are. For example, all I/O tasks get low priority since much time is spent waiting for the user to interact with the system.</li>

</ul>
There is no one "best" algorithm and most schedulers utilize a combination of the different algorithms, such as the Multi-Level Feedback Queue which in one way or another was utilized in Win XP/Vista, Linux 2.5-2.6, FreeBSD, Mac OSX, NetBSD and Solaris. One thing for certain is that as computer hardware increases in complexity, such as multiple core CPUs (parallelization), and with the advent of more powerful embedded/mobile devices, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers:
<ul>
<li>'''The default BSD/FreeBSD scheduler'''</li>
<li>'''The Linux scheduler'''</li>
</ul>

==BSD/Free BSD Schedulers==

===Overview & History===

(This work belongs to --[[User:Mike Preston|Mike Preston]] 23:41, 13 October 2010 (UTC))
(modified by --[[User:AbsMechanik|AbsMechanik]] 13:21, 14 October 2010 (UTC))

The FreeBSD kernel originally inherited its scheduler from 4.3BSD which itself is a version of the UNIX scheduler [http://dspace.hil.unb.ca:8080/bitstream/handle/1882/100/roberson.pdf?sequence=1].
In order to understand the evolution of the FreeBSD scheduler it is important to understand the original purpose and limitations of the BSD scheduler. Like most traditional UNIX based systems, the BSD scheduler was designed to work on a single core computer system (with limited I/O) and handle relatively small numbers of processes. As a result, managing resources with an O(n) scheduler did not raise any performance issues. To ensure fairness, the scheduler would switch between processes every 0.1 second (100 milliseconds) in a round-robin format [http://www.thehackademy.net/madchat/ebooks/sched/FreeBSD/the_FreeBSD_process_scheduler.pdf].

As computer systems increased in complexity with the advent of multi-core CPU's and various new I/O devices, computer programs, naturally, increased in size and complexity to accommodate and manage the new hardwares. With CPU's becoming more powerful (derived from Moore's Law [http://www.intel.com/technology/mooreslaw/]), the time taken to complete a process decreased significantly. The additional complexity highlighted the problem of having a O(n) scheduler for managing processes, as more items were added to the scheduling algorithm, the performance decreases. With symmetric multiprocessing (SMP) becoming inevitable (multi-core CPU's) a better scheduler was required. This was the driving force behind the creation of ULE for the FreeBSD.

===Older Versions===

(This work belongs to --[[User:Mike Preston|Mike Preston]] 00:02, 14 October 2010 (UTC))

The FreeBSD kernel originally used an enhanced version of the BSD scheduler. Specifically, the FreeBSD scheduler included classes of threads which was a drastic change from the round-robin scheduling used in BSD. Initially, there were two types of thread class, real-time and idle [https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf], and the scheduler would give processor time to real-time threads first and the idle threads had to wait until there were no real-time threads that needed access to the processor.

To manage the various threads, FreeBSD had data structures called runqueues into which the threads were placed. The scheduler would evaluate the runqueues based on priority from highest to lowest and execute the first thread of a non-empty runqueue it found. Once a non-empty runqueue was found, each thread in the runqueue would be assigned an equal value time slice of 0.1 seconds, a value that has not changed in over 20 years [http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156].

Unfortunately, like the BSD scheduler it was based on, the original FreeBSD scheduler was not built to handle Symmetric Multiprocessing (SMP) or Symmetric Multithreading (SMT) on multi-core systems. The sheduler was still limited by a O(n) algorithm which could not efficiently handle the loads required on ever increasingly powerful systems.
To allow FreeBSD to operate with more modern computer systems it became clear that a new scheduler would be required and, thus, became the driving force behind the implementation of ULE.

===Current Version===

(This work is owned by --[[User:Mike Preston|Mike Preston]] 00:23, 14 October 2010 (UTC))

In order to effectively manage multi-core computer systems, FreeBSD needed a scheduler with an algorithm which would execute in constant time regardless of the number of threads involved. The ULE scheduler was designed for this purpose. It is of interest to note that throughout the course of the BSD/FreeBSD scheduler evolution, each iteration has just been an improvement on existing scheduler technologies. Although each version was designed to provide support for some current reality of computing, like multi-core systems, the evolution was out of necessity and not due to a desire to re-evaluate how the current version accomplished its tasks.

The "slow" evolution of the FreeBSD scheduler becomes even more evident when comparing it to the Linux scheduler which has evolved through a series of attempts to provide alternative ways to solve scheduling tasks. From dynamic time slices, to various data structure implimentations, and even various ways of describing prioritiy levels (see: "nice" levels), the Linux scheduler advancement has occurred through a series of drastic changes. In comparison, the FreeBSD scheduler has been changed only when the current version was no longer able to meet the needs of the existing computing climate.

==Linux Schedulers==

(Note to the other group members: Feel free to modify or remove anything I post here. I'm just trying to piece together what you've all posted in the discussion section and turn it into a single paragraph. You know. Just to see how it looks.)

-- [[User:abondio2|Austin Bondio]] Last edit: 22:17, 13 October 2010 (UTC)

(Same for me, I'm trying to put together the overview/history and work on the comparison section of the essay, all based off the history you guys give. If I miss anything or get anything wrong, feel free to correct.)

-- [[User:Wlawrenc|Wesley Lawrence]]

(Austin - I added a reference to one of your sections as the current reference only went to wikipedia which the prof has kind of implied is not a good idea, I also added another one that was to a blog post as that was another thing the prof mentioned was not the best idea. I am hoping this will provide additional alidations to the sources.)
--[[User:Mike Preston|Mike Preston]] 00:29, 14 October 2010 (UTC)

===Overview & History===

(This work belongs to [[User:Wlawrenc|Wesley Lawrence]])

The Linux scheduler has a large history of improvement, always aiming towards having a fair and fast scheduler. Various methods and concepts have been tried over different versions to get this fair and fast scheduler, including round robins, iteration, and queues. A quick read through of the history of Linux implies that firstly, equal and balanced use of the system was the goal of the scheduler, and once that was in place, speed was soon improved. Early schedulers did their best to give processes equal time and resources, but used a bit of extra time (in computer terms) to accomplish this. By Linux 2.6, after experimenting with different concepts, the scheduler was able to provide fair access and time, as well as run as quickly as possible, with various features to allow personal tweaking by the system user, or even the processes themselves.

(This work was done by [[User:abondio2|Austin Bondio]], modified by [[User:Sschnei1|Sschnei1]] )

The Linux kernel has undergone many changes over the decades since its original release as the UNIX operating system in 1969 [http://www.unix.com/whats-your-mind/110099-unix-40th-birthday.html](Stallings: 2009). The early versions had relatively inefficient schedulers which operated in linear time with respect to the number of tasks to schedule; currently the Linux scheduler is able to operate in constant time, independent of the number of tasks being scheduled.

===Older Versions===

(This work belongs to [[User:Sschnei1|Sschnei1]])

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a scheduling event. The scheduler divided tasks into epochs, allowing each task to execute up to its time slice. If a task did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware architectures, such as multi-core processors.

===Current Version===

(This work was done by [[User:Sschnei1|Sschnei1]])

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor. When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently. Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side of the tree are migrated to the left side to maintain fairness.

(This work was done by [[User:abondio2|Austin Bondio]])

Under a recent Linux system (version 2.6.35 or later), scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.

==Tabulated Results==

(Once I read/see some history on the BSD section above, I'll do the best comparison I can. I'm balancing 3000/3004 and other courses (like most of you), so I don't think I can research/write BSD and write the comparison, but I will try to help out as much as I can)

-- [[User:Wlawrenc|Wesley Lawrence]]

I've got this. Hopefully most of the sections I created properly answer the question. I'm still going to go over everyone's answers and keep in mind that wikipedia cannot be cited as a resource. --[[User:AbsMechanik|AbsMechanik]] 02:29, 14 October 2010 (UTC)

==Current Challenges==

== Resources ==

1. Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

2. Stallings, William, Operating Systems: Internals and Design Principles, Pearson Prentice Hall, 2009.

3. McKusick, M. K. and Neville-Neil, G. V. 2004. Thread Scheduling in FreeBSD 5.2. Queue 2, 7 (Oct. 2004), 58-64. DOI= http://doi.acm.org/10.1145/1035594.1035622

Talk:COMP 3000 Essay 1 2010 Question 5

2010-10-13T23:27:34Z

CFaibish: /* Essay Preview */

=Resources=

I just moved the Resources section to our discussion page --[[User:AbsMechanik|AbsMechanik]] 18:19, 13 October 2010 (UTC)

I found some resources, which might be useful to answer this question. As far as I know, FreeBSD uses a Multilevel feeback queue and Linux uses in the current version the completly fair scheduler.
 
-Some text about FreeBSD-scheduling http://www.informit.com/articles/article.aspx?p=366888&seqNum=4
 
-ULE Thread Scheduler: http://www.scribd.com/doc/3299978/ULE-Thread-Scheduler-for-FreeBSD
 
-Completly Fair Scheduler: http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt
 
-Brain Fuck Scheduler: http://en.wikipedia.org/wiki/Brain_Fuck_Scheduler
 
-Sebastian

Also found a nice link with regards to the new Linux Scheduler for those interested:
http://www.ibm.com/developerworks/linux/library/l-scheduler/
 It is also referred to as the O(1) scheduler in algorithmic terms (CFS is O(log(n)) scheduler). Both have been in development by Ingo Molnár.
-Abhinav

Some more resources; 
http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html (includes history of Linux scheduler from 1.2 to 2.6) 
http://my.opera.com/blu3c4t/blog/show.dml/1531517 
-Wes 

 
Information on changes to the O(1) scheduler: 
"Linux Kernel Documentation" 
http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt 
 
General information on Linux Job Scheduling: 
"Linux Job Scheduling | Linux Journal" 
http://www.linuxjournal.com/article/4087 
 
Scheduling on multi-core Linux machines: 
"Node affine NUMA scheduler for Linux" 
http://home.arcor.de/efocht/sched/ 
 
More on Linux process scheduling: 
"Understanding the Linux kernel" 
http://oreilly.com/catalog/linuxkernel/chapter/ch10.html 
 
FreeBSD thread scheduling: 
"InformIT: FreeBSD Process Management" 
http://www.informit.com/articles/article.aspx?p=366888&seqNum=4 
- Austin Bondio

=Discussion=

From what I have been reading the early versions of the Linux scheduler had a very hard time managing high numbers of tasks at the same time. Although I do not how it ran, the scheduler algorithm operated at O(n) time. As a result as more tasks were added, the scheduler would become slower. In addition to this, a single data structure was used to manage all processors of a system which created a problem with managing cached memory between processors. The Linux 2.6 scheduler was built to resolve the task management issues in O(1), constant, time as well as addressing the multiprocessing issues.

It appears as though BSD also had issues with task management however for BSD this was due to a locking mechanism that only allowed one process at a time to operate in kernel mode. FreeBSD 5 changed this locking mechanism to allow multiple processes the ability to run in kernel mode at the same time advancing the success of symmetric multiprocessing.

--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

Hi Mike,
Can you give any names for the schedulers you are talking about? I think it is easier to distinguish by names and not by the algorithm. It is just a suggestion!

The O(1) scheduler was replaced in the linux kernel 2.6.23 with the CFS (completly fair scheduler) which runs in O(log n). Also, the schedulers before CFS were based on a Multilevel feedback queue algorithm, which was changed in 2.6.23. It is not based on a queue as most schedulers, but on a red-black-tree to implement a timeline to make future predictions. The aim of CFS is to maximize CPU utilization and maximizing the performance at the same time.

In FreeBSD 5, the ULE Scheduler was introduced but disabled by default in the early versions, which eventually changed later on. ULE has better support for SMP and SMT, thus allowing it to improve overall performance in uniprocessors and multiprocessors. And it has a constant execution time, regardless of the amount of threads.

More information can be found here:
 
http://lwn.net/Articles/230574/
 
http://lwn.net/Articles/240474/

[[User:Sschnei1|Sschnei1]] 16:33, 3 October 2010 (UTC) or Sebastian

Here is another article which essentially backs up what you are saying Sebastian: http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156

Here are the highlights from the article:

General FreeBSD knowledge:
1. requires a scheduler to be selected at the time the kernel is built.
2. all calls to scheduling code are resolved at compile time...this means that the overhead of indirect function calls for scheduling decisions is eliminated.
3. kernels up to FreeBSD 5.1 used this scheduler, but from 5.2 onward the ULE scheduler used.

Original FreeBSD Scheduler:
1. threads assigned a scheduling priority which determines which 'run queue' the thread is placed in.
2. the system scans the run queues in order of highest priority to lowest priority and executes the first thread of the first non-empty run queue it finds.
3. once a non-empty queue is found the system spends an equal time slice on each thread in the run queue. This time slice is 0.1 seconds and this value has not changed in over 20 years. A shorter time slice would cause overhead due to switching between threads too often thus reducing productivity.
4. the article then provides detailed formulae on how to determine thread priority which is out of our scope for this project.

ULE Scheduler
- overhaul of Original BSD scheduler to:
1. support symmetric multiprocessing (SMP)
2. support symmetric multithreading (SMT) on multi-core systems
3. improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

Here is another article which gives some great overview of a bunch of versions/the evolution of different schedulers: https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf
 
Some interesting pieces about the Linux scheduler include:
1. The Jan 2002 version included O(1) algorithm as well as additions for SMP.
2. Scheduler uses 2 priority queue arrays to achieve fairness. Does this by giving each thread a time slice and a priority and executes each thread in order of highest priority to lowest. Threads that exhaust their time slice are moved to the exhausted queue and threads with remaining time slices are kept in the active queue.
3. Time slices are DYNAMIC, larger time slices are given to higher priority tasks, smaller slices to lower priority tasks.
 
I thought the dynamic time slice piece was of particular interest as you would think this would lead to starvation situations if the priority was high enough on one or multiple threads.
--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

This is essentially a summarized version of the aforementioned information regarding CFS (http://www.ibm.com/developerworks/linux/library/l-scheduler/).
--[[User:AbsMechanik|AbsMechanik]] 02:32, 4 October 2010 (UTC)

I have seen this website and thought it is useful. Do you think this is enough on research to write an essay or are we going to do some more research?
--[[User:Sschnei1|Sschnei1]] 09:38, 5 October 2010 (UTC)

I also stumbled upon this website: http://my.opera.com/blu3c4t/blog/show.dml/1531517. It explains a lot of stuff in layman's terms (I had a lot of trouble finding more info on the default BSD scheduler, but this link has some brief description included in it). I think we have enough resources/research done. We should start to formulate these results into an answer now. --[[User:AbsMechanik|AbsMechanik]] 20:08, 4 October 2010 (UTC)

So I thought I would take a first crack at an intro for our article, please tell me what you think of the following. Note that I have included the resource used as a footnote, the placement of which I indicate with the number 1, and I just tacked the details of the footnote on at the bottom:

See Essay preview section!

--[[User:Mike Preston|Mike Preston]] 02:54, 6 October 2010 (UTC)

I added a part to introduce the several schedulers for LINUX. We might need to change the reference, since I got it all from http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

-- [[User:Sschnei1|Sschnei1]] 19:27, 9 October 2010 (UTC)

Maybe we should write down our contact emails and names to write down who would like to write what part.

Another suggestion is that someone should read over the text and compare it to the references posted in the "Sources" section and check if someone is doing plagiarism.

Sebastian Schneider - sebastian@gamersblog.ca

= Essay Preview =

So just a small, quick question. Are we going to follow a certain standard for citing resources (bibliography & footnotes) to maintain consistency, or do we just stick with what Mike's presented?--[[User:AbsMechanik|AbsMechanik]] 12:53, 7 October 2010 (UTC)

Maybe we should write the essay templates/prototypes here, to keep overview of the discussion part.

Just relocating previous post with suggested intro paragraph:

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system.1 As computer hardware has increased in complexity, for example multiple core CPUs, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers; the default BSD/FreeBSD and Linux schedulers.

1 Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

--[[User:Mike Preston|Mike Preston]] 03:48, 7 October 2010 (UTC)

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be
efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea
of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was
the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its
predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a
scheduling event. The scheduler divided tasks into epochs, allowing each tasks to execute up to its time slice. If a task
did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute
longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and
did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware
architectures, such as multi-core processors.

Linux-2.6 introduced another scheduler up to Linux 2.6.23. Before Linux 2.6.23 an O(1) scheduler was used. It needed the
same amount of time for each task to execute, independent of how big the tasks were.It kept track of the tasks in a
running queue. The scheduler offered much more scalability. To determine if a task was I/O bound or processor bound the
scheduler used interactive metrics with numerous heuristics. Because the code was difficult to manage and the most part of
the code was to calculate heuristics, it was replaced in Linux 2.6.23 with the CFS scheduler, which is the current
scheduler in the actual Linux versions.

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining
fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor.
When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep
fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual
runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing
and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently.
Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu
are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The
scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted
into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side
of the tree are migrated to the left side to maintain fairness. [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html]

-- [[User:Sschnei1|Sschnei1]] 19:26, 9 October 2010 (UTC)

I've started writing a bit about the Linux O(1) scheduler:

Under a Linux system, scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.[http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt]

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.[http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726]

Here's something I put into the Linux: Overview section:

The Linux kernel has undergone many changes over the decades since its original release as the UNIX operating system in 1969.[http://www.unix.com/whats-your-mind/110099-unix-40th-birthday.html] The early versions had relatively inefficient schedulers which operated in linear time with respect to the number of tasks to schedule; currently the Linux scheduler is able to operate in constant time, independent of the number of tasks being scheduled.

There are five basic algorithms for allocating CPU time[http://en.wikipedia.org/wiki/Scheduling_(computing)#Scheduling_disciplines]: <ul>
<li>First-in, First-out: No multi-tasking. Processes are queued in the order they are called. A process gets full, uninterrupted use of the CPU until it has finished running.</li>
<li>Shortest Time Remaining: Limited multi-tasking. The CPU handles the easiest tasks first, and complex, time-consuming tasks are handled last.</li>
<li>Fixed-Priority Preemptive Scheduling: Greater multi-tasking. Processes are assigned priority levels which are independent of their complexity. High-priority processes can be completed quickly, while low-priority processes can take a long time as new, higher-priority processes arrive and interrupt them.</li>
<li>Round-Robin Scheduling: Fair multi-tasking. This method is similar in concept to Fixed-Priority Preemptive Scheduling, but all processes are assigned the same priority level; that is, every running process is given an equal share of CPU time.</li>
<li>Multilevel Queue Scheduling: Rule-based multi-taksing. This method is also similar to Fixed-Priority Preemptive Scheduling, but processes are associated with groups that help determine how high their priorities are. For example, all I/O tasks get low priority since much time is spent waiting for the user to interact with the system.</li>
</ul>

-- [[User:abondio2|Austin Bondio]] Last edit: 22:27, 13 October 2010 (UTC)

I'm writing on a contrast of the CFS scheduler right now, please don't edit it.

In contrast the the O(1) scheduler, CFS realizes the model of a scheduler which can execute precise on real multitasking on real hardware. Precise multitasking means that each process can run at equal speed. If 4 processes are running at the same time, CFS assigns 25% of the CPU time to each process. On real hardware, only one task can be executed at a time and other tasks have to wait, which gives the running tasks an unfair amount of CPU time.

To avoid an unfair balance over the processes, CFS has a wait run-time for each process. CFS tries to pick the process with the highest wait run-time value. To provide a real multitasking, CFS splits up the CPU time between running processes.

Processes are not stored in a run queue, such in the O(1) scheduler, but in a self-balancing red-black tree, where self-balancing means that the task with the highest need for CPU time is stored in the most left node. Tasks with a lower need for CPU time are stored on the right side of the Tree, where tasks with a higher need for CPU time are stored on the left side. The task on the left side is picked by the scheduler and put in a virtual runtime. If the process is ready to run, it is given CPU time to run. The tree re-balances itself and new tasks can be taken out by the CPU.

CFS is designed in a way that it does not need timeslicing and still provide most performance with as much cpu utilization. This is due to the nanosecond granularity, which removes the need for jiffies or other HZ details. [http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt]

-- [[User:Sschnei1|Sschnei1]] 16:32, 13 October 2010 (UTC)

Hey guys, sorry I've been non-existent for the past little bit, here's what I've done so far. I've been going through stuff on the 4BSD and ULE schedulers, here's what I have so far:

In order for FreeBSD to function, it requires a scheduler to be selected at the time the kernel is built. Also, all calls to scheduling code are resolved at compile time, meaning that the overhead of indirect function calls for scheduling decisions is eliminated.

[3] The 4BSD scheduler was a general-purpose scheduler. Its primary goal was to balance threads’ different scheduling requirements. FreeBSD's time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements and the amount consumed by the thread. Based on the thread's priority, it gets moved between run queues. When a new thread attains a higher priority than the currently running one, the system immediately switches to the new thread, if it's in user mode. Otherwise, the system switches as soon as the current thread leaves the kernel. The system scans the run queues in order of highest to lowest priority, and executes the first thread of the first non-empty run queue it finds. The system tailors it's short-term scheduling algorithm to favor user-interactive jobs by raising the priority of threads waiting for I/O for one or more seconds, and by lowering the priority of threads that hog up significant amounts of CPU time.

[1] In older BSD systems, (and I mean old, as in 20 or so years ago), a 1 second quantum was used for the round-robin scheduling algorithm. Later, in BSD 4.2, it did rescheduling every 0.1 seconds, and priority re-computation every second, and these values haven’t changed since. Round-robin scheduling is done by a timeout mechanism, which informs the clock interrupt driver to call a certain system routine after a specified interval. The subroutine to be called, in this case, causes the rescheduling and then resubmits a timeout to call itself again 0.1 sec later. The priority re-computation is also timed by a subroutine that resubmits a timeout for itself.

The ULE Scheduler was first introduced in FreeBSD 5, however disabled by default in favor of the default 4BSD scheduler. It was not until FreeBSD 7.1 that the ULE scheduler became the new default. The ULE scheduler was an overhaul of the original scheduler, and allowed it support for symmetric multiprocessing (SMP), support for symmetric multithreading (SMT) on multi-core systems, and improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

The ULE has a constant execution time of O(1), regardless of the number of threads. In addition, it also is careful to identify interactive tasks and give them the lowest latency response possible. The scheduling components include several queues, two CPU load-balancing algorithms, an interactivity scorer, a CPU usage estimator, a priority calculator, and a slice calculator.

Since ULE is an event driven scheduler, there is no periodic timer that adjusts thread priority to prevent starvation. Fairness is implemented by maintaining two queues: the current and the next queue. Each thread that is granted a CPU slice is assigned to either the current or next queue. Threads are then picked from the current queue in order of priority until it is empty, which is when the next and current queues are switched, and the process begins again. This guarantees that each thread will be given use of its slice once every two queue cycles, regardless of priority. Interrupt and real-time threads (and threads with these priority levels) are inserted onto the current queue, for they are of the highest priority. There is also an idle class of threads, which is checked only when there are no other runnable tasks.

In order to promptly discover when a thread changes from interactive to non-interactive, the ULE scheduler uses Interactivity Scoring, a key part of the scheduler affecting the responsiveness of the system, and subsequently, user experience. An interactivity score is computed from the relationship between run and sleep time, using a formula that is out of scope for this project. Interactive threads usually have high sleep times because they are often waiting for user input. This usually is followed by quick bursts in CPU activity from processing the user's request.

The ULE also implements a priority calculator, not to maintain fairness, but only to order the threads by priority. Only time-sharing threads use the priority calculator, the rest are allotted statically. ULE uses the interactivity score to determine the nice value of a thread, which will allow it to in turn decide upon its priority.

The way that the ULE uses its nice values in combination with slices is by implementing a moving window of nice values allowed slices. The threads within the window are given a slice oppositely proportional to the difference between their nice value and the lowest recorded nice value. This results in smaller slices to nicer threads, which subsequently defines their amount of allotted CPU time. On x86, FreeBSD allows for a minimum slice value of 10ms, and a maximum of 140ms. Interactive tasks receive the smallest nice value, in order to more promptly find that the interactive task is no longer interacting.

The ULE uses a CPU usage estimator to show roughly how much CPU a thread is using. It operates on an event driven basis. ULE keeps track of the number of clock ticks that occurred within a sliding window of the thread's execution time. The window slides to upwards of one second past threshold, and then back down to regulate the ratio of run time to sleep time.

The ULE enables SMP (Symmetric Multiprocessing) in order to greater achieve CPU affinity, which is when you schedule threads onto the last CPU they were run on, as to avoid unnecessary CPU migrations. Processors typically have large caches to aid the performance of threads and processes. CPU affinity is key because the thread may still have leftover data in the caches of the previous CPU, and when a thread migrates, it not only has to load this data into the new CPU's cache, but must also clear the data from the previous CPU's cache. ULE has two methods for CPU load balancing: pull and push. Pull is when an idle CPU grabs a thread from a non-idle CPU to lend a hand. Push is when a periodic task evaluates the current load situation of the CPUs and balances it out amongst them. These two function side by side, and allow for an optimally balanced CPU workload.

SMT (Symmetric Multithreading), a concept of non-uniform processors, is not fully present in ULE. The foundations are there, however, which can eventually be extended to support NUMA (Non-Uniform Memory Architecture). This involves expressing the penalties of CPU migration through separate queues, which could be extended to add a local and global load-balancing policy. As far as my sources go, FreeBSD does not at this point support NUMA, however the groundwork is there, and it is a real possibility for it to appear in a future version.

1 = http://www.cim.mcgill.ca/~franco/OpSys-304-427/lecture-notes/node46.html
2 = http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc
3 = McKusick, M. K. and Neville-Neil, G. V. 2004. Thread Scheduling in FreeBSD 5.2. Queue 2, 7 (Oct. 2004), 58-64. DOI= http://doi.acm.org/10.1145/1035594.1035622

Notes: Lots of this is just paraphrasing stuff you guys said in the discussion section. In terms of citations, should it be a superscripted citation next to the fact snippet we used, or should it just be a list of sources at the bottom?

--[[User:CFaibish|CFaibish]] 23:27, 13 October 2010 (UTC)

I would agree with putting superscripted citations that refer to the Sources section? How do they do it in the wikipedia?
-- [[User:Sschnei1|Sschnei1]] 18:52, 13 October 2010 (UTC)

Superscripted citations seems to be the best way to do it. If we cite URLs throughout the essay, it will be much harder to read. To put in a superscripted citation, enclose the URL of your source in square brackets.

Also, who here is actually good at writing, and can compile all these paragraphs into one nice essay for us? I think we have enough raw information here, it's just a matter of putting it all together now.

-- [[abondio2|Austin Bondio]] 20:39, 13 October 2010 (UTC)

Abhinav is putting something together right now on the main page.

-- [[User:Sschnei1|Sschnei1]] 20:56, 13 October 2010 (UTC)

Hi, here's a little forward on schedulers in relation to types of threads I've composed based off of one of my sources, I'm not sure if its necessary since there is one Mike typed above, but here it just for you guys to examine:

Threads that perform a lot of I/O require a fast response time to keep input and output devices busy, but need little CPU time. On the other hand, compute-bound threads need to receive a lot of CPU time to finish their work, but have no requirement for fast response time. Other threads lie somewhere in between, with periods of I/O punctuated by periods of computation, and thus have requirements that vary over time. A well-designed scheduler should be able accommodate threads with all these requirements simultaneously.

Also: as Mike said earlier about BSD's issue with locking mechanisms, should I go into greater detail about that, or just include a little, few sentence description of the issue? I've found a source for what I think is what he was referring to: http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc

I'll be posting more of what I've got on the BSD stuff under the hour.

--[[User:CFaibish|CFaibish]] 22:59, 13 October 2010 (UTC)

= Sources =

[1] http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

[2] http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt

[3] http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726

[4] http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt

Talk:COMP 3000 Essay 1 2010 Question 5

2010-10-13T23:00:38Z

CFaibish: /* Essay Preview */

=Resources=

I just moved the Resources section to our discussion page --[[User:AbsMechanik|AbsMechanik]] 18:19, 13 October 2010 (UTC)

I found some resources, which might be useful to answer this question. As far as I know, FreeBSD uses a Multilevel feeback queue and Linux uses in the current version the completly fair scheduler.
 
-Some text about FreeBSD-scheduling http://www.informit.com/articles/article.aspx?p=366888&seqNum=4
 
-ULE Thread Scheduler: http://www.scribd.com/doc/3299978/ULE-Thread-Scheduler-for-FreeBSD
 
-Completly Fair Scheduler: http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt
 
-Brain Fuck Scheduler: http://en.wikipedia.org/wiki/Brain_Fuck_Scheduler
 
-Sebastian

Also found a nice link with regards to the new Linux Scheduler for those interested:
http://www.ibm.com/developerworks/linux/library/l-scheduler/
 It is also referred to as the O(1) scheduler in algorithmic terms (CFS is O(log(n)) scheduler). Both have been in development by Ingo Molnár.
-Abhinav

Some more resources; 
http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html (includes history of Linux scheduler from 1.2 to 2.6) 
http://my.opera.com/blu3c4t/blog/show.dml/1531517 
-Wes 

 
Information on changes to the O(1) scheduler: 
"Linux Kernel Documentation" 
http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt 
 
General information on Linux Job Scheduling: 
"Linux Job Scheduling | Linux Journal" 
http://www.linuxjournal.com/article/4087 
 
Scheduling on multi-core Linux machines: 
"Node affine NUMA scheduler for Linux" 
http://home.arcor.de/efocht/sched/ 
 
More on Linux process scheduling: 
"Understanding the Linux kernel" 
http://oreilly.com/catalog/linuxkernel/chapter/ch10.html 
 
FreeBSD thread scheduling: 
"InformIT: FreeBSD Process Management" 
http://www.informit.com/articles/article.aspx?p=366888&seqNum=4 
- Austin Bondio

=Discussion=

From what I have been reading the early versions of the Linux scheduler had a very hard time managing high numbers of tasks at the same time. Although I do not how it ran, the scheduler algorithm operated at O(n) time. As a result as more tasks were added, the scheduler would become slower. In addition to this, a single data structure was used to manage all processors of a system which created a problem with managing cached memory between processors. The Linux 2.6 scheduler was built to resolve the task management issues in O(1), constant, time as well as addressing the multiprocessing issues.

It appears as though BSD also had issues with task management however for BSD this was due to a locking mechanism that only allowed one process at a time to operate in kernel mode. FreeBSD 5 changed this locking mechanism to allow multiple processes the ability to run in kernel mode at the same time advancing the success of symmetric multiprocessing.

--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

Hi Mike,
Can you give any names for the schedulers you are talking about? I think it is easier to distinguish by names and not by the algorithm. It is just a suggestion!

The O(1) scheduler was replaced in the linux kernel 2.6.23 with the CFS (completly fair scheduler) which runs in O(log n). Also, the schedulers before CFS were based on a Multilevel feedback queue algorithm, which was changed in 2.6.23. It is not based on a queue as most schedulers, but on a red-black-tree to implement a timeline to make future predictions. The aim of CFS is to maximize CPU utilization and maximizing the performance at the same time.

In FreeBSD 5, the ULE Scheduler was introduced but disabled by default in the early versions, which eventually changed later on. ULE has better support for SMP and SMT, thus allowing it to improve overall performance in uniprocessors and multiprocessors. And it has a constant execution time, regardless of the amount of threads.

More information can be found here:
 
http://lwn.net/Articles/230574/
 
http://lwn.net/Articles/240474/

[[User:Sschnei1|Sschnei1]] 16:33, 3 October 2010 (UTC) or Sebastian

Here is another article which essentially backs up what you are saying Sebastian: http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156

Here are the highlights from the article:

General FreeBSD knowledge:
1. requires a scheduler to be selected at the time the kernel is built.
2. all calls to scheduling code are resolved at compile time...this means that the overhead of indirect function calls for scheduling decisions is eliminated.
3. kernels up to FreeBSD 5.1 used this scheduler, but from 5.2 onward the ULE scheduler used.

Original FreeBSD Scheduler:
1. threads assigned a scheduling priority which determines which 'run queue' the thread is placed in.
2. the system scans the run queues in order of highest priority to lowest priority and executes the first thread of the first non-empty run queue it finds.
3. once a non-empty queue is found the system spends an equal time slice on each thread in the run queue. This time slice is 0.1 seconds and this value has not changed in over 20 years. A shorter time slice would cause overhead due to switching between threads too often thus reducing productivity.
4. the article then provides detailed formulae on how to determine thread priority which is out of our scope for this project.

ULE Scheduler
- overhaul of Original BSD scheduler to:
1. support symmetric multiprocessing (SMP)
2. support symmetric multithreading (SMT) on multi-core systems
3. improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

Here is another article which gives some great overview of a bunch of versions/the evolution of different schedulers: https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf
 
Some interesting pieces about the Linux scheduler include:
1. The Jan 2002 version included O(1) algorithm as well as additions for SMP.
2. Scheduler uses 2 priority queue arrays to achieve fairness. Does this by giving each thread a time slice and a priority and executes each thread in order of highest priority to lowest. Threads that exhaust their time slice are moved to the exhausted queue and threads with remaining time slices are kept in the active queue.
3. Time slices are DYNAMIC, larger time slices are given to higher priority tasks, smaller slices to lower priority tasks.
 
I thought the dynamic time slice piece was of particular interest as you would think this would lead to starvation situations if the priority was high enough on one or multiple threads.
--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

This is essentially a summarized version of the aforementioned information regarding CFS (http://www.ibm.com/developerworks/linux/library/l-scheduler/).
--[[User:AbsMechanik|AbsMechanik]] 02:32, 4 October 2010 (UTC)

I have seen this website and thought it is useful. Do you think this is enough on research to write an essay or are we going to do some more research?
--[[User:Sschnei1|Sschnei1]] 09:38, 5 October 2010 (UTC)

I also stumbled upon this website: http://my.opera.com/blu3c4t/blog/show.dml/1531517. It explains a lot of stuff in layman's terms (I had a lot of trouble finding more info on the default BSD scheduler, but this link has some brief description included in it). I think we have enough resources/research done. We should start to formulate these results into an answer now. --[[User:AbsMechanik|AbsMechanik]] 20:08, 4 October 2010 (UTC)

So I thought I would take a first crack at an intro for our article, please tell me what you think of the following. Note that I have included the resource used as a footnote, the placement of which I indicate with the number 1, and I just tacked the details of the footnote on at the bottom:

See Essay preview section!

--[[User:Mike Preston|Mike Preston]] 02:54, 6 October 2010 (UTC)

I added a part to introduce the several schedulers for LINUX. We might need to change the reference, since I got it all from http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

-- [[User:Sschnei1|Sschnei1]] 19:27, 9 October 2010 (UTC)

Maybe we should write down our contact emails and names to write down who would like to write what part.

Another suggestion is that someone should read over the text and compare it to the references posted in the "Sources" section and check if someone is doing plagiarism.

Sebastian Schneider - sebastian@gamersblog.ca

= Essay Preview =

So just a small, quick question. Are we going to follow a certain standard for citing resources (bibliography & footnotes) to maintain consistency, or do we just stick with what Mike's presented?--[[User:AbsMechanik|AbsMechanik]] 12:53, 7 October 2010 (UTC)

Maybe we should write the essay templates/prototypes here, to keep overview of the discussion part.

Just relocating previous post with suggested intro paragraph:

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system.1 As computer hardware has increased in complexity, for example multiple core CPUs, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers; the default BSD/FreeBSD and Linux schedulers.

1 Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

--[[User:Mike Preston|Mike Preston]] 03:48, 7 October 2010 (UTC)

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be
efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea
of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was
the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its
predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a
scheduling event. The scheduler divided tasks into epochs, allowing each tasks to execute up to its time slice. If a task
did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute
longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and
did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware
architectures, such as multi-core processors.

Linux-2.6 introduced another scheduler up to Linux 2.6.23. Before Linux 2.6.23 an O(1) scheduler was used. It needed the
same amount of time for each task to execute, independent of how big the tasks were.It kept track of the tasks in a
running queue. The scheduler offered much more scalability. To determine if a task was I/O bound or processor bound the
scheduler used interactive metrics with numerous heuristics. Because the code was difficult to manage and the most part of
the code was to calculate heuristics, it was replaced in Linux 2.6.23 with the CFS scheduler, which is the current
scheduler in the actual Linux versions.

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining
fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor.
When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep
fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual
runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing
and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently.
Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu
are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The
scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted
into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side
of the tree are migrated to the left side to maintain fairness. [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html]

-- [[User:Sschnei1|Sschnei1]] 19:26, 9 October 2010 (UTC)

I've started writing a bit about the Linux O(1) scheduler:

Under a Linux system, scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.[http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt]

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.[http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726]

Here's something I put into the Linux: Overview section:

The Linux kernel has undergone many changes over the decades since its original release as the UNIX operating system in 1969.[http://www.unix.com/whats-your-mind/110099-unix-40th-birthday.html] The early versions had relatively inefficient schedulers which operated in linear time with respect to the number of tasks to schedule; currently the Linux scheduler is able to operate in constant time, independent of the number of tasks being scheduled.

There are five basic algorithms for allocating CPU time[http://en.wikipedia.org/wiki/Scheduling_(computing)#Scheduling_disciplines]: <ul>
<li>First-in, First-out: No multi-tasking. Processes are queued in the order they are called. A process gets full, uninterrupted use of the CPU until it has finished running.</li>
<li>Shortest Time Remaining: Limited multi-tasking. The CPU handles the easiest tasks first, and complex, time-consuming tasks are handled last.</li>
<li>Fixed-Priority Preemptive Scheduling: Greater multi-tasking. Processes are assigned priority levels which are independent of their complexity. High-priority processes can be completed quickly, while low-priority processes can take a long time as new, higher-priority processes arrive and interrupt them.</li>
<li>Round-Robin Scheduling: Fair multi-tasking. This method is similar in concept to Fixed-Priority Preemptive Scheduling, but all processes are assigned the same priority level; that is, every running process is given an equal share of CPU time.</li>
<li>Multilevel Queue Scheduling: Rule-based multi-taksing. This method is also similar to Fixed-Priority Preemptive Scheduling, but processes are associated with groups that help determine how high their priorities are. For example, all I/O tasks get low priority since much time is spent waiting for the user to interact with the system.</li>
</ul>

-- [[User:abondio2|Austin Bondio]] Last edit: 22:27, 13 October 2010 (UTC)

I'm writing on a contrast of the CFS scheduler right now, please don't edit it.

In contrast the the O(1) scheduler, CFS realizes the model of a scheduler which can execute precise on real multitasking on real hardware. Precise multitasking means that each process can run at equal speed. If 4 processes are running at the same time, CFS assigns 25% of the CPU time to each process. On real hardware, only one task can be executed at a time and other tasks have to wait, which gives the running tasks an unfair amount of CPU time.

To avoid an unfair balance over the processes, CFS has a wait run-time for each process. CFS tries to pick the process with the highest wait run-time value. To provide a real multitasking, CFS splits up the CPU time between running processes.

Processes are not stored in a run queue, such in the O(1) scheduler, but in a self-balancing red-black tree, where self-balancing means that the task with the highest need for CPU time is stored in the most left node. Tasks with a lower need for CPU time are stored on the right side of the Tree, where tasks with a higher need for CPU time are stored on the left side. The task on the left side is picked by the scheduler and put in a virtual runtime. If the process is ready to run, it is given CPU time to run. The tree re-balances itself and new tasks can be taken out by the CPU.

CFS is designed in a way that it does not need timeslicing and still provide most performance with as much cpu utilization. This is due to the nanosecond granularity, which removes the need for jiffies or other HZ details. [http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt]

-- [[User:Sschnei1|Sschnei1]] 16:32, 13 October 2010 (UTC)

Hey guys, sorry I've been non-existent for the past little bit, here's what I've done so far. I've been going through stuff on the 4BSD and ULE schedulers, here's what I have so far:

In order for FreeBSD to function, it requires a scheduler to be selected at the time the kernel is built. Also, all calls to scheduling code are resolved at compile time, meaning that the overhead of indirect function calls for scheduling decisions is eliminated.

[3] The 4BSD scheduler was a general-purpose scheduler. Its primary goal was to balance threads’ different scheduling requirements. FreeBSD's time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements and the amount consumed by the thread. Based on the thread's priority, it gets moved between run queues. When a new thread attains a higher priority than the currently running one, the system immediately switches to the new thread, if it's in user mode. Otherwise, the system switches as soon as the current thread leaves the kernel. The system scans the run queues in order of highest to lowest priority, and executes the first thread of the first non-empty run queue it finds. The system tailors it's short-term scheduling algorithm to favor user-interactive jobs by raising the priority of threads waiting for I/O for one or more seconds, and by lowering the priority of threads that hog up significant amounts of CPU time.

[1] In older BSD systems, (and I mean old, as in 20 or so years ago), a 1 second quantum was used for the round-robin scheduling algorithm. Later, in BSD 4.2, it did rescheduling every 0.1 seconds, and priority re-computation every second, and these values haven’t changed since. Round-robin scheduling is done by a timeout mechanism, which informs the clock interrupt driver to call a certain system routine after a specified interval. The subroutine to be called, in this case, causes the rescheduling and then resubmits a timeout to call itself again 0.1 sec later. The priority re-computation is also timed by a subroutine that resubmits a timeout for itself.

The ULE Scheduler was first introduced in FreeBSD 5, however disabled by default in favor of the default 4BSD scheduler. It was not until FreeBSD 7.1 that the ULE scheduler became the new default. The ULE scheduler was an overhaul of the original scheduler, and allowed it support for symmetric multiprocessing (SMP), support for symmetric multithreading (SMT) on multi-core systems, and improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.
<more to come>

1 = http://www.cim.mcgill.ca/~franco/OpSys-304-427/lecture-notes/node46.html
2 = http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc
3 = McKusick, M. K. and Neville-Neil, G. V. 2004. Thread Scheduling in FreeBSD 5.2. Queue 2, 7 (Oct. 2004), 58-64. DOI= http://doi.acm.org/10.1145/1035594.1035622

Notes: Lots of this is just paraphrasing stuff you guys said in the discussion section. In terms of citations, should it be a superscripted citation next to the fact snippet we used, or should it just be a list of sources at the bottom?

--[[User:CFaibish|CFaibish]] 17:51, 13 October 2010 (UTC)

I would agree with putting superscripted citations that refer to the Sources section? How do they do it in the wikipedia?
-- [[User:Sschnei1|Sschnei1]] 18:52, 13 October 2010 (UTC)

Superscripted citations seems to be the best way to do it. If we cite URLs throughout the essay, it will be much harder to read. To put in a superscripted citation, enclose the URL of your source in square brackets.

Also, who here is actually good at writing, and can compile all these paragraphs into one nice essay for us? I think we have enough raw information here, it's just a matter of putting it all together now.

-- [[abondio2|Austin Bondio]] 20:39, 13 October 2010 (UTC)

Abhinav is putting something together right now on the main page.

-- [[User:Sschnei1|Sschnei1]] 20:56, 13 October 2010 (UTC)

Hi, here's a little forward on schedulers in relation to types of threads I've composed based off of one of my sources, I'm not sure if its necessary since there is one Mike typed above, but here it just for you guys to examine:

Threads that perform a lot of I/O require a fast response time to keep input and output devices busy, but need little CPU time. On the other hand, compute-bound threads need to receive a lot of CPU time to finish their work, but have no requirement for fast response time. Other threads lie somewhere in between, with periods of I/O punctuated by periods of computation, and thus have requirements that vary over time. A well-designed scheduler should be able accommodate threads with all these requirements simultaneously.

Also: as Mike said earlier about BSD's issue with locking mechanisms, should I go into greater detail about that, or just include a little, few sentence description of the issue? I've found a source for what I think is what he was referring to: http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc

I'll be posting more of what I've got on the BSD stuff under the hour.

--[[User:CFaibish|CFaibish]] 22:59, 13 October 2010 (UTC)

= Sources =

[1] http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

[2] http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt

[3] http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726

[4] http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt

Talk:COMP 3000 Essay 1 2010 Question 5

2010-10-13T22:59:49Z

CFaibish: /* Essay Preview */

=Resources=

I just moved the Resources section to our discussion page --[[User:AbsMechanik|AbsMechanik]] 18:19, 13 October 2010 (UTC)

I found some resources, which might be useful to answer this question. As far as I know, FreeBSD uses a Multilevel feeback queue and Linux uses in the current version the completly fair scheduler.
 
-Some text about FreeBSD-scheduling http://www.informit.com/articles/article.aspx?p=366888&seqNum=4
 
-ULE Thread Scheduler: http://www.scribd.com/doc/3299978/ULE-Thread-Scheduler-for-FreeBSD
 
-Completly Fair Scheduler: http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt
 
-Brain Fuck Scheduler: http://en.wikipedia.org/wiki/Brain_Fuck_Scheduler
 
-Sebastian

Also found a nice link with regards to the new Linux Scheduler for those interested:
http://www.ibm.com/developerworks/linux/library/l-scheduler/
 It is also referred to as the O(1) scheduler in algorithmic terms (CFS is O(log(n)) scheduler). Both have been in development by Ingo Molnár.
-Abhinav

Some more resources; 
http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html (includes history of Linux scheduler from 1.2 to 2.6) 
http://my.opera.com/blu3c4t/blog/show.dml/1531517 
-Wes 

 
Information on changes to the O(1) scheduler: 
"Linux Kernel Documentation" 
http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt 
 
General information on Linux Job Scheduling: 
"Linux Job Scheduling | Linux Journal" 
http://www.linuxjournal.com/article/4087 
 
Scheduling on multi-core Linux machines: 
"Node affine NUMA scheduler for Linux" 
http://home.arcor.de/efocht/sched/ 
 
More on Linux process scheduling: 
"Understanding the Linux kernel" 
http://oreilly.com/catalog/linuxkernel/chapter/ch10.html 
 
FreeBSD thread scheduling: 
"InformIT: FreeBSD Process Management" 
http://www.informit.com/articles/article.aspx?p=366888&seqNum=4 
- Austin Bondio

=Discussion=

From what I have been reading the early versions of the Linux scheduler had a very hard time managing high numbers of tasks at the same time. Although I do not how it ran, the scheduler algorithm operated at O(n) time. As a result as more tasks were added, the scheduler would become slower. In addition to this, a single data structure was used to manage all processors of a system which created a problem with managing cached memory between processors. The Linux 2.6 scheduler was built to resolve the task management issues in O(1), constant, time as well as addressing the multiprocessing issues.

It appears as though BSD also had issues with task management however for BSD this was due to a locking mechanism that only allowed one process at a time to operate in kernel mode. FreeBSD 5 changed this locking mechanism to allow multiple processes the ability to run in kernel mode at the same time advancing the success of symmetric multiprocessing.

--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

Hi Mike,
Can you give any names for the schedulers you are talking about? I think it is easier to distinguish by names and not by the algorithm. It is just a suggestion!

The O(1) scheduler was replaced in the linux kernel 2.6.23 with the CFS (completly fair scheduler) which runs in O(log n). Also, the schedulers before CFS were based on a Multilevel feedback queue algorithm, which was changed in 2.6.23. It is not based on a queue as most schedulers, but on a red-black-tree to implement a timeline to make future predictions. The aim of CFS is to maximize CPU utilization and maximizing the performance at the same time.

In FreeBSD 5, the ULE Scheduler was introduced but disabled by default in the early versions, which eventually changed later on. ULE has better support for SMP and SMT, thus allowing it to improve overall performance in uniprocessors and multiprocessors. And it has a constant execution time, regardless of the amount of threads.

More information can be found here:
 
http://lwn.net/Articles/230574/
 
http://lwn.net/Articles/240474/

[[User:Sschnei1|Sschnei1]] 16:33, 3 October 2010 (UTC) or Sebastian

Here is another article which essentially backs up what you are saying Sebastian: http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156

Here are the highlights from the article:

General FreeBSD knowledge:
1. requires a scheduler to be selected at the time the kernel is built.
2. all calls to scheduling code are resolved at compile time...this means that the overhead of indirect function calls for scheduling decisions is eliminated.
3. kernels up to FreeBSD 5.1 used this scheduler, but from 5.2 onward the ULE scheduler used.

Original FreeBSD Scheduler:
1. threads assigned a scheduling priority which determines which 'run queue' the thread is placed in.
2. the system scans the run queues in order of highest priority to lowest priority and executes the first thread of the first non-empty run queue it finds.
3. once a non-empty queue is found the system spends an equal time slice on each thread in the run queue. This time slice is 0.1 seconds and this value has not changed in over 20 years. A shorter time slice would cause overhead due to switching between threads too often thus reducing productivity.
4. the article then provides detailed formulae on how to determine thread priority which is out of our scope for this project.

ULE Scheduler
- overhaul of Original BSD scheduler to:
1. support symmetric multiprocessing (SMP)
2. support symmetric multithreading (SMT) on multi-core systems
3. improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

Here is another article which gives some great overview of a bunch of versions/the evolution of different schedulers: https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf
 
Some interesting pieces about the Linux scheduler include:
1. The Jan 2002 version included O(1) algorithm as well as additions for SMP.
2. Scheduler uses 2 priority queue arrays to achieve fairness. Does this by giving each thread a time slice and a priority and executes each thread in order of highest priority to lowest. Threads that exhaust their time slice are moved to the exhausted queue and threads with remaining time slices are kept in the active queue.
3. Time slices are DYNAMIC, larger time slices are given to higher priority tasks, smaller slices to lower priority tasks.
 
I thought the dynamic time slice piece was of particular interest as you would think this would lead to starvation situations if the priority was high enough on one or multiple threads.
--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

This is essentially a summarized version of the aforementioned information regarding CFS (http://www.ibm.com/developerworks/linux/library/l-scheduler/).
--[[User:AbsMechanik|AbsMechanik]] 02:32, 4 October 2010 (UTC)

I have seen this website and thought it is useful. Do you think this is enough on research to write an essay or are we going to do some more research?
--[[User:Sschnei1|Sschnei1]] 09:38, 5 October 2010 (UTC)

I also stumbled upon this website: http://my.opera.com/blu3c4t/blog/show.dml/1531517. It explains a lot of stuff in layman's terms (I had a lot of trouble finding more info on the default BSD scheduler, but this link has some brief description included in it). I think we have enough resources/research done. We should start to formulate these results into an answer now. --[[User:AbsMechanik|AbsMechanik]] 20:08, 4 October 2010 (UTC)

So I thought I would take a first crack at an intro for our article, please tell me what you think of the following. Note that I have included the resource used as a footnote, the placement of which I indicate with the number 1, and I just tacked the details of the footnote on at the bottom:

See Essay preview section!

--[[User:Mike Preston|Mike Preston]] 02:54, 6 October 2010 (UTC)

I added a part to introduce the several schedulers for LINUX. We might need to change the reference, since I got it all from http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

-- [[User:Sschnei1|Sschnei1]] 19:27, 9 October 2010 (UTC)

Maybe we should write down our contact emails and names to write down who would like to write what part.

Another suggestion is that someone should read over the text and compare it to the references posted in the "Sources" section and check if someone is doing plagiarism.

Sebastian Schneider - sebastian@gamersblog.ca

= Essay Preview =

So just a small, quick question. Are we going to follow a certain standard for citing resources (bibliography & footnotes) to maintain consistency, or do we just stick with what Mike's presented?--[[User:AbsMechanik|AbsMechanik]] 12:53, 7 October 2010 (UTC)

Maybe we should write the essay templates/prototypes here, to keep overview of the discussion part.

Just relocating previous post with suggested intro paragraph:

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system.1 As computer hardware has increased in complexity, for example multiple core CPUs, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers; the default BSD/FreeBSD and Linux schedulers.

1 Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

--[[User:Mike Preston|Mike Preston]] 03:48, 7 October 2010 (UTC)

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be
efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea
of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was
the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its
predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a
scheduling event. The scheduler divided tasks into epochs, allowing each tasks to execute up to its time slice. If a task
did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute
longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and
did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware
architectures, such as multi-core processors.

Linux-2.6 introduced another scheduler up to Linux 2.6.23. Before Linux 2.6.23 an O(1) scheduler was used. It needed the
same amount of time for each task to execute, independent of how big the tasks were.It kept track of the tasks in a
running queue. The scheduler offered much more scalability. To determine if a task was I/O bound or processor bound the
scheduler used interactive metrics with numerous heuristics. Because the code was difficult to manage and the most part of
the code was to calculate heuristics, it was replaced in Linux 2.6.23 with the CFS scheduler, which is the current
scheduler in the actual Linux versions.

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining
fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor.
When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep
fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual
runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing
and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently.
Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu
are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The
scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted
into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side
of the tree are migrated to the left side to maintain fairness. [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html]

-- [[User:Sschnei1|Sschnei1]] 19:26, 9 October 2010 (UTC)

I've started writing a bit about the Linux O(1) scheduler:

Under a Linux system, scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.[http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt]

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.[http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726]

Here's something I put into the Linux: Overview section:

The Linux kernel has undergone many changes over the decades since its original release as the UNIX operating system in 1969.[http://www.unix.com/whats-your-mind/110099-unix-40th-birthday.html] The early versions had relatively inefficient schedulers which operated in linear time with respect to the number of tasks to schedule; currently the Linux scheduler is able to operate in constant time, independent of the number of tasks being scheduled.

There are five basic algorithms for allocating CPU time[http://en.wikipedia.org/wiki/Scheduling_(computing)#Scheduling_disciplines]: <ul>
<li>First-in, First-out: No multi-tasking. Processes are queued in the order they are called. A process gets full, uninterrupted use of the CPU until it has finished running.</li>
<li>Shortest Time Remaining: Limited multi-tasking. The CPU handles the easiest tasks first, and complex, time-consuming tasks are handled last.</li>
<li>Fixed-Priority Preemptive Scheduling: Greater multi-tasking. Processes are assigned priority levels which are independent of their complexity. High-priority processes can be completed quickly, while low-priority processes can take a long time as new, higher-priority processes arrive and interrupt them.</li>
<li>Round-Robin Scheduling: Fair multi-tasking. This method is similar in concept to Fixed-Priority Preemptive Scheduling, but all processes are assigned the same priority level; that is, every running process is given an equal share of CPU time.</li>
<li>Multilevel Queue Scheduling: Rule-based multi-taksing. This method is also similar to Fixed-Priority Preemptive Scheduling, but processes are associated with groups that help determine how high their priorities are. For example, all I/O tasks get low priority since much time is spent waiting for the user to interact with the system.</li>
</ul>

-- [[User:abondio2|Austin Bondio]] Last edit: 22:27, 13 October 2010 (UTC)

I'm writing on a contrast of the CFS scheduler right now, please don't edit it.

In contrast the the O(1) scheduler, CFS realizes the model of a scheduler which can execute precise on real multitasking on real hardware. Precise multitasking means that each process can run at equal speed. If 4 processes are running at the same time, CFS assigns 25% of the CPU time to each process. On real hardware, only one task can be executed at a time and other tasks have to wait, which gives the running tasks an unfair amount of CPU time.

To avoid an unfair balance over the processes, CFS has a wait run-time for each process. CFS tries to pick the process with the highest wait run-time value. To provide a real multitasking, CFS splits up the CPU time between running processes.

Processes are not stored in a run queue, such in the O(1) scheduler, but in a self-balancing red-black tree, where self-balancing means that the task with the highest need for CPU time is stored in the most left node. Tasks with a lower need for CPU time are stored on the right side of the Tree, where tasks with a higher need for CPU time are stored on the left side. The task on the left side is picked by the scheduler and put in a virtual runtime. If the process is ready to run, it is given CPU time to run. The tree re-balances itself and new tasks can be taken out by the CPU.

CFS is designed in a way that it does not need timeslicing and still provide most performance with as much cpu utilization. This is due to the nanosecond granularity, which removes the need for jiffies or other HZ details. [http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt]

-- [[User:Sschnei1|Sschnei1]] 16:32, 13 October 2010 (UTC)

Hey guys, sorry I've been non-existent for the past little bit, here's what I've done so far. I've been going through stuff on the 4BSD and ULE schedulers, here's what I have so far:

In order for FreeBSD to function, it requires a scheduler to be selected at the time the kernel is built. Also, all calls to scheduling code are resolved at compile time, meaning that the overhead of indirect function calls for scheduling decisions is eliminated.

[3] The 4BSD scheduler was a general-purpose scheduler. Its primary goal was to balance threads’ different scheduling requirements. FreeBSD's time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements and the amount consumed by the thread. Based on the thread's priority, it gets moved between run queues. When a new thread attains a higher priority than the currently running one, the system immediately switches to the new thread, if it's in user mode. Otherwise, the system switches as soon as the current thread leaves the kernel. The system scans the run queues in order of highest to lowest priority, and executes the first thread of the first non-empty run queue it finds. The system tailors it's short-term scheduling algorithm to favor user-interactive jobs by raising the priority of threads waiting for I/O for one or more seconds, and by lowering the priority of threads that hog up significant amounts of CPU time.

[1] In older BSD systems, (and I mean old, as in 20 or so years ago), a 1 second quantum was used for the round-robin scheduling algorithm. Later, in BSD 4.2, it did rescheduling every 0.1 seconds, and priority re-computation every second, and these values haven’t changed since. Round-robin scheduling is done by a timeout mechanism, which informs the clock interrupt driver to call a certain system routine after a specified interval. The subroutine to be called, in this case, causes the rescheduling and then resubmits a timeout to call itself again 0.1 sec later. The priority re-computation is also timed by a subroutine that resubmits a timeout for itself.

The ULE Scheduler was first introduced in FreeBSD 5, however disabled by default in favor of the default 4BSD scheduler. It was not until FreeBSD 7.1 that the ULE scheduler became the new default. The ULE scheduler was an overhaul of the original scheduler, and allowed it support for symmetric multiprocessing (SMP), support for symmetric multithreading (SMT) on multi-core systems, and improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.
<more to come>

1 = http://www.cim.mcgill.ca/~franco/OpSys-304-427/lecture-notes/node46.html
2 = http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc
3 = McKusick, M. K. and Neville-Neil, G. V. 2004. Thread Scheduling in FreeBSD 5.2. Queue 2, 7 (Oct. 2004), 58-64. DOI= http://doi.acm.org/10.1145/1035594.1035622

Notes: Lots of this is just paraphrasing stuff you guys said in the discussion section. In terms of citations, should it be a superscripted citation next to the fact snippet we used, or should it just be a list of sources at the bottom?

--[[User:CFaibish|CFaibish]] 17:51, 13 October 2010 (UTC)

I would agree with putting superscripted citations that refer to the Sources section? How do they do it in the wikipedia?
-- [[User:Sschnei1|Sschnei1]] 18:52, 13 October 2010 (UTC)

Superscripted citations seems to be the best way to do it. If we cite URLs throughout the essay, it will be much harder to read. To put in a superscripted citation, enclose the URL of your source in square brackets.

Also, who here is actually good at writing, and can compile all these paragraphs into one nice essay for us? I think we have enough raw information here, it's just a matter of putting it all together now.

-- [[abondio2|Austin Bondio]] 20:39, 13 October 2010 (UTC)

Abhinav is putting something together right now on the main page.

-- [[User:Sschnei1|Sschnei1]] 20:56, 13 October 2010 (UTC)

Hi, here's a little forward on schedulers in relation to types of threads I've composed based off of one of my sources, I'm not sure if its necessary since there is one Mike typed below, but here it just for you guys to examine:

Threads that perform a lot of I/O require a fast response time to keep input and output devices busy, but need little CPU time. On the other hand, compute-bound threads need to receive a lot of CPU time to finish their work, but have no requirement for fast response time. Other threads lie somewhere in between, with periods of I/O punctuated by periods of computation, and thus have requirements that vary over time. A well-designed scheduler should be able accommodate threads with all these requirements simultaneously.

Also: as Mike said earlier about BSD's issue with locking mechanisms, should I go into greater detail about that, or just include a little, few sentence description of the issue? I've found a source for what I think is what he was referring to: http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc

I'll be posting more of what I've got on the BSD stuff under the hour.

--[[User:CFaibish|CFaibish]] 22:59, 13 October 2010 (UTC)

= Sources =

[1] http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

[2] http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt

[3] http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726

[4] http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt

Talk:COMP 3000 Essay 1 2010 Question 5

2010-10-13T22:58:59Z

CFaibish: /* Discussion */

=Resources=

I just moved the Resources section to our discussion page --[[User:AbsMechanik|AbsMechanik]] 18:19, 13 October 2010 (UTC)

I found some resources, which might be useful to answer this question. As far as I know, FreeBSD uses a Multilevel feeback queue and Linux uses in the current version the completly fair scheduler.
 
-Some text about FreeBSD-scheduling http://www.informit.com/articles/article.aspx?p=366888&seqNum=4
 
-ULE Thread Scheduler: http://www.scribd.com/doc/3299978/ULE-Thread-Scheduler-for-FreeBSD
 
-Completly Fair Scheduler: http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt
 
-Brain Fuck Scheduler: http://en.wikipedia.org/wiki/Brain_Fuck_Scheduler
 
-Sebastian

Also found a nice link with regards to the new Linux Scheduler for those interested:
http://www.ibm.com/developerworks/linux/library/l-scheduler/
 It is also referred to as the O(1) scheduler in algorithmic terms (CFS is O(log(n)) scheduler). Both have been in development by Ingo Molnár.
-Abhinav

Some more resources; 
http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html (includes history of Linux scheduler from 1.2 to 2.6) 
http://my.opera.com/blu3c4t/blog/show.dml/1531517 
-Wes 

 
Information on changes to the O(1) scheduler: 
"Linux Kernel Documentation" 
http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt 
 
General information on Linux Job Scheduling: 
"Linux Job Scheduling | Linux Journal" 
http://www.linuxjournal.com/article/4087 
 
Scheduling on multi-core Linux machines: 
"Node affine NUMA scheduler for Linux" 
http://home.arcor.de/efocht/sched/ 
 
More on Linux process scheduling: 
"Understanding the Linux kernel" 
http://oreilly.com/catalog/linuxkernel/chapter/ch10.html 
 
FreeBSD thread scheduling: 
"InformIT: FreeBSD Process Management" 
http://www.informit.com/articles/article.aspx?p=366888&seqNum=4 
- Austin Bondio

=Discussion=

From what I have been reading the early versions of the Linux scheduler had a very hard time managing high numbers of tasks at the same time. Although I do not how it ran, the scheduler algorithm operated at O(n) time. As a result as more tasks were added, the scheduler would become slower. In addition to this, a single data structure was used to manage all processors of a system which created a problem with managing cached memory between processors. The Linux 2.6 scheduler was built to resolve the task management issues in O(1), constant, time as well as addressing the multiprocessing issues.

It appears as though BSD also had issues with task management however for BSD this was due to a locking mechanism that only allowed one process at a time to operate in kernel mode. FreeBSD 5 changed this locking mechanism to allow multiple processes the ability to run in kernel mode at the same time advancing the success of symmetric multiprocessing.

--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

Hi Mike,
Can you give any names for the schedulers you are talking about? I think it is easier to distinguish by names and not by the algorithm. It is just a suggestion!

The O(1) scheduler was replaced in the linux kernel 2.6.23 with the CFS (completly fair scheduler) which runs in O(log n). Also, the schedulers before CFS were based on a Multilevel feedback queue algorithm, which was changed in 2.6.23. It is not based on a queue as most schedulers, but on a red-black-tree to implement a timeline to make future predictions. The aim of CFS is to maximize CPU utilization and maximizing the performance at the same time.

In FreeBSD 5, the ULE Scheduler was introduced but disabled by default in the early versions, which eventually changed later on. ULE has better support for SMP and SMT, thus allowing it to improve overall performance in uniprocessors and multiprocessors. And it has a constant execution time, regardless of the amount of threads.

More information can be found here:
 
http://lwn.net/Articles/230574/
 
http://lwn.net/Articles/240474/

[[User:Sschnei1|Sschnei1]] 16:33, 3 October 2010 (UTC) or Sebastian

Here is another article which essentially backs up what you are saying Sebastian: http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156

Here are the highlights from the article:

General FreeBSD knowledge:
1. requires a scheduler to be selected at the time the kernel is built.
2. all calls to scheduling code are resolved at compile time...this means that the overhead of indirect function calls for scheduling decisions is eliminated.
3. kernels up to FreeBSD 5.1 used this scheduler, but from 5.2 onward the ULE scheduler used.

Original FreeBSD Scheduler:
1. threads assigned a scheduling priority which determines which 'run queue' the thread is placed in.
2. the system scans the run queues in order of highest priority to lowest priority and executes the first thread of the first non-empty run queue it finds.
3. once a non-empty queue is found the system spends an equal time slice on each thread in the run queue. This time slice is 0.1 seconds and this value has not changed in over 20 years. A shorter time slice would cause overhead due to switching between threads too often thus reducing productivity.
4. the article then provides detailed formulae on how to determine thread priority which is out of our scope for this project.

ULE Scheduler
- overhaul of Original BSD scheduler to:
1. support symmetric multiprocessing (SMP)
2. support symmetric multithreading (SMT) on multi-core systems
3. improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

Here is another article which gives some great overview of a bunch of versions/the evolution of different schedulers: https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf
 
Some interesting pieces about the Linux scheduler include:
1. The Jan 2002 version included O(1) algorithm as well as additions for SMP.
2. Scheduler uses 2 priority queue arrays to achieve fairness. Does this by giving each thread a time slice and a priority and executes each thread in order of highest priority to lowest. Threads that exhaust their time slice are moved to the exhausted queue and threads with remaining time slices are kept in the active queue.
3. Time slices are DYNAMIC, larger time slices are given to higher priority tasks, smaller slices to lower priority tasks.
 
I thought the dynamic time slice piece was of particular interest as you would think this would lead to starvation situations if the priority was high enough on one or multiple threads.
--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

This is essentially a summarized version of the aforementioned information regarding CFS (http://www.ibm.com/developerworks/linux/library/l-scheduler/).
--[[User:AbsMechanik|AbsMechanik]] 02:32, 4 October 2010 (UTC)

I have seen this website and thought it is useful. Do you think this is enough on research to write an essay or are we going to do some more research?
--[[User:Sschnei1|Sschnei1]] 09:38, 5 October 2010 (UTC)

I also stumbled upon this website: http://my.opera.com/blu3c4t/blog/show.dml/1531517. It explains a lot of stuff in layman's terms (I had a lot of trouble finding more info on the default BSD scheduler, but this link has some brief description included in it). I think we have enough resources/research done. We should start to formulate these results into an answer now. --[[User:AbsMechanik|AbsMechanik]] 20:08, 4 October 2010 (UTC)

So I thought I would take a first crack at an intro for our article, please tell me what you think of the following. Note that I have included the resource used as a footnote, the placement of which I indicate with the number 1, and I just tacked the details of the footnote on at the bottom:

See Essay preview section!

--[[User:Mike Preston|Mike Preston]] 02:54, 6 October 2010 (UTC)

I added a part to introduce the several schedulers for LINUX. We might need to change the reference, since I got it all from http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

-- [[User:Sschnei1|Sschnei1]] 19:27, 9 October 2010 (UTC)

Maybe we should write down our contact emails and names to write down who would like to write what part.

Another suggestion is that someone should read over the text and compare it to the references posted in the "Sources" section and check if someone is doing plagiarism.

Sebastian Schneider - sebastian@gamersblog.ca

= Essay Preview =

So just a small, quick question. Are we going to follow a certain standard for citing resources (bibliography & footnotes) to maintain consistency, or do we just stick with what Mike's presented?--[[User:AbsMechanik|AbsMechanik]] 12:53, 7 October 2010 (UTC)

Maybe we should write the essay templates/prototypes here, to keep overview of the discussion part.

Just relocating previous post with suggested intro paragraph:

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system.1 As computer hardware has increased in complexity, for example multiple core CPUs, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers; the default BSD/FreeBSD and Linux schedulers.

1 Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

--[[User:Mike Preston|Mike Preston]] 03:48, 7 October 2010 (UTC)

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be
efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea
of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was
the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its
predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a
scheduling event. The scheduler divided tasks into epochs, allowing each tasks to execute up to its time slice. If a task
did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute
longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and
did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware
architectures, such as multi-core processors.

Linux-2.6 introduced another scheduler up to Linux 2.6.23. Before Linux 2.6.23 an O(1) scheduler was used. It needed the
same amount of time for each task to execute, independent of how big the tasks were.It kept track of the tasks in a
running queue. The scheduler offered much more scalability. To determine if a task was I/O bound or processor bound the
scheduler used interactive metrics with numerous heuristics. Because the code was difficult to manage and the most part of
the code was to calculate heuristics, it was replaced in Linux 2.6.23 with the CFS scheduler, which is the current
scheduler in the actual Linux versions.

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining
fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor.
When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep
fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual
runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing
and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently.
Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu
are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The
scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted
into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side
of the tree are migrated to the left side to maintain fairness. [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html]

-- [[User:Sschnei1|Sschnei1]] 19:26, 9 October 2010 (UTC)

I've started writing a bit about the Linux O(1) scheduler:

Under a Linux system, scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.[http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt]

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.[http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726]

Here's something I put into the Linux: Overview section:

The Linux kernel has undergone many changes over the decades since its original release as the UNIX operating system in 1969.[http://www.unix.com/whats-your-mind/110099-unix-40th-birthday.html] The early versions had relatively inefficient schedulers which operated in linear time with respect to the number of tasks to schedule; currently the Linux scheduler is able to operate in constant time, independent of the number of tasks being scheduled.

There are five basic algorithms for allocating CPU time[http://en.wikipedia.org/wiki/Scheduling_(computing)#Scheduling_disciplines]: <ul>
<li>First-in, First-out: No multi-tasking. Processes are queued in the order they are called. A process gets full, uninterrupted use of the CPU until it has finished running.</li>
<li>Shortest Time Remaining: Limited multi-tasking. The CPU handles the easiest tasks first, and complex, time-consuming tasks are handled last.</li>
<li>Fixed-Priority Preemptive Scheduling: Greater multi-tasking. Processes are assigned priority levels which are independent of their complexity. High-priority processes can be completed quickly, while low-priority processes can take a long time as new, higher-priority processes arrive and interrupt them.</li>
<li>Round-Robin Scheduling: Fair multi-tasking. This method is similar in concept to Fixed-Priority Preemptive Scheduling, but all processes are assigned the same priority level; that is, every running process is given an equal share of CPU time.</li>
<li>Multilevel Queue Scheduling: Rule-based multi-taksing. This method is also similar to Fixed-Priority Preemptive Scheduling, but processes are associated with groups that help determine how high their priorities are. For example, all I/O tasks get low priority since much time is spent waiting for the user to interact with the system.</li>
</ul>

-- [[User:abondio2|Austin Bondio]] Last edit: 22:27, 13 October 2010 (UTC)

I'm writing on a contrast of the CFS scheduler right now, please don't edit it.

In contrast the the O(1) scheduler, CFS realizes the model of a scheduler which can execute precise on real multitasking on real hardware. Precise multitasking means that each process can run at equal speed. If 4 processes are running at the same time, CFS assigns 25% of the CPU time to each process. On real hardware, only one task can be executed at a time and other tasks have to wait, which gives the running tasks an unfair amount of CPU time.

To avoid an unfair balance over the processes, CFS has a wait run-time for each process. CFS tries to pick the process with the highest wait run-time value. To provide a real multitasking, CFS splits up the CPU time between running processes.

Processes are not stored in a run queue, such in the O(1) scheduler, but in a self-balancing red-black tree, where self-balancing means that the task with the highest need for CPU time is stored in the most left node. Tasks with a lower need for CPU time are stored on the right side of the Tree, where tasks with a higher need for CPU time are stored on the left side. The task on the left side is picked by the scheduler and put in a virtual runtime. If the process is ready to run, it is given CPU time to run. The tree re-balances itself and new tasks can be taken out by the CPU.

CFS is designed in a way that it does not need timeslicing and still provide most performance with as much cpu utilization. This is due to the nanosecond granularity, which removes the need for jiffies or other HZ details. [http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt]

-- [[User:Sschnei1|Sschnei1]] 16:32, 13 October 2010 (UTC)

Hey guys, sorry I've been non-existent for the past little bit, here's what I've done so far. I've been going through stuff on the 4BSD and ULE schedulers, here's what I have so far:

In order for FreeBSD to function, it requires a scheduler to be selected at the time the kernel is built. Also, all calls to scheduling code are resolved at compile time, meaning that the overhead of indirect function calls for scheduling decisions is eliminated.

[3] The 4BSD scheduler was a general-purpose scheduler. Its primary goal was to balance threads’ different scheduling requirements. FreeBSD's time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements and the amount consumed by the thread. Based on the thread's priority, it gets moved between run queues. When a new thread attains a higher priority than the currently running one, the system immediately switches to the new thread, if it's in user mode. Otherwise, the system switches as soon as the current thread leaves the kernel. The system scans the run queues in order of highest to lowest priority, and executes the first thread of the first non-empty run queue it finds. The system tailors it's short-term scheduling algorithm to favor user-interactive jobs by raising the priority of threads waiting for I/O for one or more seconds, and by lowering the priority of threads that hog up significant amounts of CPU time.

[1] In older BSD systems, (and I mean old, as in 20 or so years ago), a 1 second quantum was used for the round-robin scheduling algorithm. Later, in BSD 4.2, it did rescheduling every 0.1 seconds, and priority re-computation every second, and these values haven’t changed since. Round-robin scheduling is done by a timeout mechanism, which informs the clock interrupt driver to call a certain system routine after a specified interval. The subroutine to be called, in this case, causes the rescheduling and then resubmits a timeout to call itself again 0.1 sec later. The priority re-computation is also timed by a subroutine that resubmits a timeout for itself.

The ULE Scheduler was first introduced in FreeBSD 5, however disabled by default in favor of the default 4BSD scheduler. It was not until FreeBSD 7.1 that the ULE scheduler became the new default. The ULE scheduler was an overhaul of the original scheduler, and allowed it support for symmetric multiprocessing (SMP), support for symmetric multithreading (SMT) on multi-core systems, and improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.
<more to come>

1 = http://www.cim.mcgill.ca/~franco/OpSys-304-427/lecture-notes/node46.html
2 = http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc
3 = McKusick, M. K. and Neville-Neil, G. V. 2004. Thread Scheduling in FreeBSD 5.2. Queue 2, 7 (Oct. 2004), 58-64. DOI= http://doi.acm.org/10.1145/1035594.1035622

Notes: Lots of this is just paraphrasing stuff you guys said in the discussion section. In terms of citations, should it be a superscripted citation next to the fact snippet we used, or should it just be a list of sources at the bottom?

--[[User:CFaibish|CFaibish]] 17:51, 13 October 2010 (UTC)

I would agree with putting superscripted citations that refer to the Sources section? How do they do it in the wikipedia?
-- [[User:Sschnei1|Sschnei1]] 18:52, 13 October 2010 (UTC)

Superscripted citations seems to be the best way to do it. If we cite URLs throughout the essay, it will be much harder to read. To put in a superscripted citation, enclose the URL of your source in square brackets.

Also, who here is actually good at writing, and can compile all these paragraphs into one nice essay for us? I think we have enough raw information here, it's just a matter of putting it all together now.

-- [[abondio2|Austin Bondio]] 20:39, 13 October 2010 (UTC)

Abhinav is putting something together right now on the main page.

-- [[User:Sschnei1|Sschnei1]] 20:56, 13 October 2010 (UTC)

= Sources =

[1] http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

[2] http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt

[3] http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726

[4] http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt

Talk:COMP 3000 Essay 1 2010 Question 5

2010-10-13T17:57:50Z

CFaibish: /* Discussion */

=Discussion=

From what I have been reading the early versions of the Linux scheduler had a very hard time managing high numbers of tasks at the same time. Although I do not how it ran, the scheduler algorithm operated at O(n) time. As a result as more tasks were added, the scheduler would become slower. In addition to this, a single data structure was used to manage all processors of a system which created a problem with managing cached memory between processors. The Linux 2.6 scheduler was built to resolve the task management issues in O(1), constant, time as well as addressing the multiprocessing issues.

It appears as though BSD also had issues with task management however for BSD this was due to a locking mechanism that only allowed one process at a time to operate in kernel mode. FreeBSD 5 changed this locking mechanism to allow multiple processes the ability to run in kernel mode at the same time advancing the success of symmetric multiprocessing.

--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

Hi Mike,
Can you give any names for the schedulers you are talking about? I think it is easier to distinguish by names and not by the algorithm. It is just a suggestion!

The O(1) scheduler was replaced in the linux kernel 2.6.23 with the CFS (completly fair scheduler) which runs in O(log n). Also, the schedulers before CFS were based on a Multilevel feedback queue algorithm, which was changed in 2.6.23. It is not based on a queue as most schedulers, but on a red-black-tree to implement a timeline to make future predictions. The aim of CFS is to maximize CPU utilization and maximizing the performance at the same time.

In FreeBSD 5, the ULE Scheduler was introduced but disabled by default in the early versions, which eventually changed later on. ULE has better support for SMP and SMT, thus allowing it to improve overall performance in uniprocessors and multiprocessors. And it has a constant execution time, regardless of the amount of threads.

More information can be found here:
 
http://lwn.net/Articles/230574/
 
http://lwn.net/Articles/240474/

[[User:Sschnei1|Sschnei1]] 16:33, 3 October 2010 (UTC) or Sebastian

Here is another article which essentially backs up what you are saying Sebastian: http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156

Here are the highlights from the article:

General FreeBSD knowledge:
1. requires a scheduler to be selected at the time the kernel is built.
2. all calls to scheduling code are resolved at compile time...this means that the overhead of indirect function calls for scheduling decisions is eliminated.
3. kernels up to FreeBSD 5.1 used this scheduler, but from 5.2 onward the ULE scheduler used.

Original FreeBSD Scheduler:
1. threads assigned a scheduling priority which determines which 'run queue' the thread is placed in.
2. the system scans the run queues in order of highest priority to lowest priority and executes the first thread of the first non-empty run queue it finds.
3. once a non-empty queue is found the system spends an equal time slice on each thread in the run queue. This time slice is 0.1 seconds and this value has not changed in over 20 years. A shorter time slice would cause overhead due to switching between threads too often thus reducing productivity.
4. the article then provides detailed formulae on how to determine thread priority which is out of our scope for this project.

ULE Scheduler
- overhaul of Original BSD scheduler to:
1. support symmetric multiprocessing (SMP)
2. support symmetric multithreading (SMT) on multi-core systems
3. improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

Here is another article which gives some great overview of a bunch of versions/the evolution of different schedulers: https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf
 
Some interesting pieces about the Linux scheduler include:
1. The Jan 2002 version included O(1) algorithm as well as additions for SMP.
2. Scheduler uses 2 priority queue arrays to achieve fairness. Does this by giving each thread a time slice and a priority and executes each thread in order of highest priority to lowest. Threads that exhaust their time slice are moved to the exhausted queue and threads with remaining time slices are kept in the active queue.
3. Time slices are DYNAMIC, larger time slices are given to higher priority tasks, smaller slices to lower priority tasks.
 
I thought the dynamic time slice piece was of particular interest as you would think this would lead to starvation situations if the priority was high enough on one or multiple threads.
--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

This is essentially a summarized version of the aforementioned information regarding CFS (http://www.ibm.com/developerworks/linux/library/l-scheduler/).
--[[User:AbsMechanik|AbsMechanik]] 02:32, 4 October 2010 (UTC)

I have seen this website and thought it is useful. Do you think this is enough on research to write an essay or are we going to do some more research?
--[[User:Sschnei1|Sschnei1]] 09:38, 5 October 2010 (UTC)

I also stumbled upon this website: http://my.opera.com/blu3c4t/blog/show.dml/1531517. It explains a lot of stuff in layman's terms (I had a lot of trouble finding more info on the default BSD scheduler, but this link has some brief description included in it). I think we have enough resources/research done. We should start to formulate these results into an answer now. --[[User:AbsMechanik|AbsMechanik]] 20:08, 4 October 2010 (UTC)

So I thought I would take a first crack at an intro for our article, please tell me what you think of the following. Note that I have included the resource used as a footnote, the placement of which I indicate with the number 1, and I just tacked the details of the footnote on at the bottom:

See Essay preview section!

--[[User:Mike Preston|Mike Preston]] 02:54, 6 October 2010 (UTC)

I added a part to introduce the several schedulers for LINUX. We might need to change the reference, since I got it all from http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

-- [[User:Sschnei1|Sschnei1]] 19:27, 9 October 2010 (UTC)

Maybe we should write down our contact emails and names to write down who would like to write what part.

Another suggestion is that someone should read over the text and compare it to the references posted in the "Sources" section and check if someone is doing plagiarism.

Sebastian Schneider - sebastian@gamersblog.ca

Hi, here's a little forward on schedulers in relation to types of threads I've composed based off of one of my sources, I'm not sure if its necessary since there is one Mike typed below, but here it just for you guys to examine:

Threads that perform a lot of I/O require a fast response time to keep input and output devices busy, but need little CPU time. On the other hand, compute-bound threads need to receive a lot of CPU time to finish their work, but have no requirement for fast response time. Other threads lie somewhere in between, with periods of I/O punctuated by periods of computation, and thus have requirements that vary over time. A well-designed scheduler should be able accommodate threads with all these requirements simultaneously.

Also: as Mike said earlier about BSD's issue with locking mechanisms, should I go into greater detail about that, or just include a little, few sentence description of the issue? I've found a source for what I think is what he was referring to: http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc
--[[User:CFaibish|CFaibish]] 17:54, 13 October 2010 (UTC)

= Essay Preview =

So just a small, quick question. Are we going to follow a certain standard for citing resources (bibliography & footnotes) to maintain consistency, or do we just stick with what Mike's presented?--[[User:AbsMechanik|AbsMechanik]] 12:53, 7 October 2010 (UTC)

Maybe we should write the essay templates/prototypes here, to keep overview of the discussion part.

Just relocating previous post with suggested intro paragraph:

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system.1 As computer hardware has increased in complexity, for example multiple core CPUs, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers; the default BSD/FreeBSD and Linux schedulers.

1 Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

--[[User:Mike Preston|Mike Preston]] 03:48, 7 October 2010 (UTC)

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be
efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea
of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was
the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its
predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a
scheduling event. The scheduler divided tasks into epochs, allowing each tasks to execute up to its time slice. If a task
did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute
longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and
did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware
architectures, such as multi-core processors.

Linux-2.6 introduced another scheduler up to Linux 2.6.23. Before Linux 2.6.23 an O(1) scheduler was used. It needed the
same amount of time for each task to execute, independent of how big the tasks were.It kept track of the tasks in a
running queue. The scheduler offered much more scalability. To determine if a task was I/O bound or processor bound the
scheduler used interactive metrics with numerous heuristics. Because the code was difficult to manage and the most part of
the code was to calculate heuristics, it was replaced in Linux 2.6.23 with the CFS scheduler, which is the current
scheduler in the actual Linux versions.

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining
fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor.
When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep
fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual
runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing
and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently.
Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu
are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The
scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted
into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side
of the tree are migrated to the left side to maintain fairness. [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html]

-- [[User:Sschnei1|Sschnei1]] 19:26, 9 October 2010 (UTC)

I've started writing a bit about the Linux O(1) scheduler:

Under a Linux system, scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.[http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt]

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.[http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726]

-- [[User:abondio2|Austin Bondio]] Last edit: 14:39, 12 October 2010

I'm writing on a contrast of the CFS scheduler right now, please don't edit it.

In contrast the the O(1) scheduler, CFS realizes the model of a scheduler which can execute precise on real multitasking on real hardware. Precise multitasking means that each process can run at equal speed. If 4 processes are running at the same time, CFS assigns 25% of the CPU time to each process. On real hardware, only one task can be executed at a time and other tasks have to wait, which gives the running tasks an unfair amount of CPU time.

To avoid an unfair balance over the processes, CFS has a wait run-time for each process. CFS tries to pick the process with the highest wait run-time value. To provide a real multitasking, CFS splits up the CPU time between running processes.

Processes are not stored in a run queue, but in a self-balancing red-black tree, where self-balancing means that the task with the highest need for CPU time is stored in the most left node. Tasks with a lower need for CPU time are stored on the right side of the Tree, where tasks with a higher need for CPU time are stored on the left side. The task on the left side is picked by the scheduler and given CPU time to run. The tree re-balances itself and new tasks can be inserted.

CFS is designed in a way that it does not need timeslicing. This is due to the nanosecond granularity, which removes the need for jiffies or other HZ details. [http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt]

-- [[User:Sschnei1|Sschnei1]] 16:32, 13 October 2010 (UTC)

Hey guys, sorry I've been non-existent for the past little bit, here's what I've done so far. I've been going through stuff on the 4BSD and ULE schedulers, here's what I have so far:

In order for FreeBSD to function, it requires a scheduler to be selected at the time the kernel is built. Also, all calls to scheduling code are resolved at compile time, meaning that the overhead of indirect function calls for scheduling decisions is eliminated.

[3] The 4BSD scheduler was a general-purpose scheduler. Its primary goal was to balance threads’ different scheduling requirements. FreeBSD's time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements and the amount consumed by the thread. Based on the thread's priority, it gets moved between run queues. When a new thread attains a higher priority than the currently running one, the system immediately switches to the new thread, if it's in user mode. Otherwise, the system switches as soon as the current thread leaves the kernel. The system scans the run queues in order of highest to lowest priority, and executes the first thread of the first non-empty run queue it finds. The system tailors it's short-term scheduling algorithm to favor user-interactive jobs by raising the priority of threads waiting for I/O for one or more seconds, and by lowering the priority of threads that hog up significant amounts of CPU time.

[1] In older BSD systems, (and I mean old, as in 20 or so years ago), a 1 second quantum was used for the round-robin scheduling algorithm. Later, in BSD 4.2, it did rescheduling every 0.1 seconds, and priority re-computation every second, and these values haven’t changed since. Round-robin scheduling is done by a timeout mechanism, which informs the clock interrupt driver to call a certain system routine after a specified interval. The subroutine to be called, in this case, causes the rescheduling and then resubmits a timeout to call itself again 0.1 sec later. The priority re-computation is also timed by a subroutine that resubmits a timeout for itself.

The ULE Scheduler was first introduced in FreeBSD 5, however disabled by default in favor of the default 4BSD scheduler. It was not until FreeBSD 7.1 that the ULE scheduler became the new default. The ULE scheduler was an overhaul of the original scheduler, and allowed it support for symmetric multiprocessing (SMP), support for symmetric multithreading (SMT) on multi-core systems, and improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.
<more to come>

1 = http://www.cim.mcgill.ca/~franco/OpSys-304-427/lecture-notes/node46.html
2 = http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc
3 = McKusick, M. K. and Neville-Neil, G. V. 2004. Thread Scheduling in FreeBSD 5.2. Queue 2, 7 (Oct. 2004), 58-64. DOI= http://doi.acm.org/10.1145/1035594.1035622

Notes: Lots of this is just paraphrasing stuff you guys said in the discussion section. In terms of citations, should it be a superscripted citation next to the fact snippet we used, or should it just be a list of sources at the bottom?

--[[User:CFaibish|CFaibish]] 17:51, 13 October 2010 (UTC)

= Sources =

[1] http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

[2] http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt

[3] http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726

[4] http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt

Talk:COMP 3000 Essay 1 2010 Question 5

2010-10-13T17:55:46Z

CFaibish: /* Essay Preview */

=Discussion=

From what I have been reading the early versions of the Linux scheduler had a very hard time managing high numbers of tasks at the same time. Although I do not how it ran, the scheduler algorithm operated at O(n) time. As a result as more tasks were added, the scheduler would become slower. In addition to this, a single data structure was used to manage all processors of a system which created a problem with managing cached memory between processors. The Linux 2.6 scheduler was built to resolve the task management issues in O(1), constant, time as well as addressing the multiprocessing issues.

It appears as though BSD also had issues with task management however for BSD this was due to a locking mechanism that only allowed one process at a time to operate in kernel mode. FreeBSD 5 changed this locking mechanism to allow multiple processes the ability to run in kernel mode at the same time advancing the success of symmetric multiprocessing.

--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

Hi Mike,
Can you give any names for the schedulers you are talking about? I think it is easier to distinguish by names and not by the algorithm. It is just a suggestion!

The O(1) scheduler was replaced in the linux kernel 2.6.23 with the CFS (completly fair scheduler) which runs in O(log n). Also, the schedulers before CFS were based on a Multilevel feedback queue algorithm, which was changed in 2.6.23. It is not based on a queue as most schedulers, but on a red-black-tree to implement a timeline to make future predictions. The aim of CFS is to maximize CPU utilization and maximizing the performance at the same time.

In FreeBSD 5, the ULE Scheduler was introduced but disabled by default in the early versions, which eventually changed later on. ULE has better support for SMP and SMT, thus allowing it to improve overall performance in uniprocessors and multiprocessors. And it has a constant execution time, regardless of the amount of threads.

More information can be found here:
 
http://lwn.net/Articles/230574/
 
http://lwn.net/Articles/240474/

[[User:Sschnei1|Sschnei1]] 16:33, 3 October 2010 (UTC) or Sebastian

Here is another article which essentially backs up what you are saying Sebastian: http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156

Here are the highlights from the article:

General FreeBSD knowledge:
1. requires a scheduler to be selected at the time the kernel is built.
2. all calls to scheduling code are resolved at compile time...this means that the overhead of indirect function calls for scheduling decisions is eliminated.
3. kernels up to FreeBSD 5.1 used this scheduler, but from 5.2 onward the ULE scheduler used.

Original FreeBSD Scheduler:
1. threads assigned a scheduling priority which determines which 'run queue' the thread is placed in.
2. the system scans the run queues in order of highest priority to lowest priority and executes the first thread of the first non-empty run queue it finds.
3. once a non-empty queue is found the system spends an equal time slice on each thread in the run queue. This time slice is 0.1 seconds and this value has not changed in over 20 years. A shorter time slice would cause overhead due to switching between threads too often thus reducing productivity.
4. the article then provides detailed formulae on how to determine thread priority which is out of our scope for this project.

ULE Scheduler
- overhaul of Original BSD scheduler to:
1. support symmetric multiprocessing (SMP)
2. support symmetric multithreading (SMT) on multi-core systems
3. improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

Here is another article which gives some great overview of a bunch of versions/the evolution of different schedulers: https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf
 
Some interesting pieces about the Linux scheduler include:
1. The Jan 2002 version included O(1) algorithm as well as additions for SMP.
2. Scheduler uses 2 priority queue arrays to achieve fairness. Does this by giving each thread a time slice and a priority and executes each thread in order of highest priority to lowest. Threads that exhaust their time slice are moved to the exhausted queue and threads with remaining time slices are kept in the active queue.
3. Time slices are DYNAMIC, larger time slices are given to higher priority tasks, smaller slices to lower priority tasks.
 
I thought the dynamic time slice piece was of particular interest as you would think this would lead to starvation situations if the priority was high enough on one or multiple threads.
--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

This is essentially a summarized version of the aforementioned information regarding CFS (http://www.ibm.com/developerworks/linux/library/l-scheduler/).
--[[User:AbsMechanik|AbsMechanik]] 02:32, 4 October 2010 (UTC)

I have seen this website and thought it is useful. Do you think this is enough on research to write an essay or are we going to do some more research?
--[[User:Sschnei1|Sschnei1]] 09:38, 5 October 2010 (UTC)

I also stumbled upon this website: http://my.opera.com/blu3c4t/blog/show.dml/1531517. It explains a lot of stuff in layman's terms (I had a lot of trouble finding more info on the default BSD scheduler, but this link has some brief description included in it). I think we have enough resources/research done. We should start to formulate these results into an answer now. --[[User:AbsMechanik|AbsMechanik]] 20:08, 4 October 2010 (UTC)

So I thought I would take a first crack at an intro for our article, please tell me what you think of the following. Note that I have included the resource used as a footnote, the placement of which I indicate with the number 1, and I just tacked the details of the footnote on at the bottom:

See Essay preview section!

--[[User:Mike Preston|Mike Preston]] 02:54, 6 October 2010 (UTC)

I added a part to introduce the several schedulers for LINUX. We might need to change the reference, since I got it all from http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

-- [[User:Sschnei1|Sschnei1]] 19:27, 9 October 2010 (UTC)

Maybe we should write down our contact emails and names to write down who would like to write what part.

Another suggestion is that someone should read over the text and compare it to the references posted in the "Sources" section and check if someone is doing plagiarism.

Sebastian Schneider - sebastian@gamersblog.ca

Hi, here's a little forward on schedulers in relation to types of threads I've composed based off of one of my sources, I'm not sure if its necessary since there is one Mike typed below, but here it just for you guys to examine:

Threads that perform a lot of I/O require a fast response time to keep input and output devices busy, but need little CPU time. On the other hand, compute-bound threads need to receive a lot of CPU time to finish their work, but have no requirement for fast response time. Other threads lie somewhere in between, with periods of I/O punctuated by periods of computation, and thus have requirements that vary over time. A well-designed scheduler should be able accommodate threads with all these requirements simultaneously.

--[[User:CFaibish|CFaibish]] 17:54, 13 October 2010 (UTC)

= Essay Preview =

So just a small, quick question. Are we going to follow a certain standard for citing resources (bibliography & footnotes) to maintain consistency, or do we just stick with what Mike's presented?--[[User:AbsMechanik|AbsMechanik]] 12:53, 7 October 2010 (UTC)

Maybe we should write the essay templates/prototypes here, to keep overview of the discussion part.

Just relocating previous post with suggested intro paragraph:

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system.1 As computer hardware has increased in complexity, for example multiple core CPUs, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers; the default BSD/FreeBSD and Linux schedulers.

1 Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

--[[User:Mike Preston|Mike Preston]] 03:48, 7 October 2010 (UTC)

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be
efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea
of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was
the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its
predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a
scheduling event. The scheduler divided tasks into epochs, allowing each tasks to execute up to its time slice. If a task
did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute
longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and
did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware
architectures, such as multi-core processors.

Linux-2.6 introduced another scheduler up to Linux 2.6.23. Before Linux 2.6.23 an O(1) scheduler was used. It needed the
same amount of time for each task to execute, independent of how big the tasks were.It kept track of the tasks in a
running queue. The scheduler offered much more scalability. To determine if a task was I/O bound or processor bound the
scheduler used interactive metrics with numerous heuristics. Because the code was difficult to manage and the most part of
the code was to calculate heuristics, it was replaced in Linux 2.6.23 with the CFS scheduler, which is the current
scheduler in the actual Linux versions.

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining
fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor.
When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep
fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual
runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing
and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently.
Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu
are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The
scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted
into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side
of the tree are migrated to the left side to maintain fairness. [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html]

-- [[User:Sschnei1|Sschnei1]] 19:26, 9 October 2010 (UTC)

I've started writing a bit about the Linux O(1) scheduler:

Under a Linux system, scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.[http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt]

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.[http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726]

-- [[User:abondio2|Austin Bondio]] Last edit: 14:39, 12 October 2010

I'm writing on a contrast of the CFS scheduler right now, please don't edit it.

In contrast the the O(1) scheduler, CFS realizes the model of a scheduler which can execute precise on real multitasking on real hardware. Precise multitasking means that each process can run at equal speed. If 4 processes are running at the same time, CFS assigns 25% of the CPU time to each process. On real hardware, only one task can be executed at a time and other tasks have to wait, which gives the running tasks an unfair amount of CPU time.

To avoid an unfair balance over the processes, CFS has a wait run-time for each process. CFS tries to pick the process with the highest wait run-time value. To provide a real multitasking, CFS splits up the CPU time between running processes.

Processes are not stored in a run queue, but in a self-balancing red-black tree, where self-balancing means that the task with the highest need for CPU time is stored in the most left node. Tasks with a lower need for CPU time are stored on the right side of the Tree, where tasks with a higher need for CPU time are stored on the left side. The task on the left side is picked by the scheduler and given CPU time to run. The tree re-balances itself and new tasks can be inserted.

CFS is designed in a way that it does not need timeslicing. This is due to the nanosecond granularity, which removes the need for jiffies or other HZ details. [http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt]

-- [[User:Sschnei1|Sschnei1]] 16:32, 13 October 2010 (UTC)

Hey guys, sorry I've been non-existent for the past little bit, here's what I've done so far. I've been going through stuff on the 4BSD and ULE schedulers, here's what I have so far:

In order for FreeBSD to function, it requires a scheduler to be selected at the time the kernel is built. Also, all calls to scheduling code are resolved at compile time, meaning that the overhead of indirect function calls for scheduling decisions is eliminated.

[3] The 4BSD scheduler was a general-purpose scheduler. Its primary goal was to balance threads’ different scheduling requirements. FreeBSD's time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements and the amount consumed by the thread. Based on the thread's priority, it gets moved between run queues. When a new thread attains a higher priority than the currently running one, the system immediately switches to the new thread, if it's in user mode. Otherwise, the system switches as soon as the current thread leaves the kernel. The system scans the run queues in order of highest to lowest priority, and executes the first thread of the first non-empty run queue it finds. The system tailors it's short-term scheduling algorithm to favor user-interactive jobs by raising the priority of threads waiting for I/O for one or more seconds, and by lowering the priority of threads that hog up significant amounts of CPU time.

[1] In older BSD systems, (and I mean old, as in 20 or so years ago), a 1 second quantum was used for the round-robin scheduling algorithm. Later, in BSD 4.2, it did rescheduling every 0.1 seconds, and priority re-computation every second, and these values haven’t changed since. Round-robin scheduling is done by a timeout mechanism, which informs the clock interrupt driver to call a certain system routine after a specified interval. The subroutine to be called, in this case, causes the rescheduling and then resubmits a timeout to call itself again 0.1 sec later. The priority re-computation is also timed by a subroutine that resubmits a timeout for itself.

The ULE Scheduler was first introduced in FreeBSD 5, however disabled by default in favor of the default 4BSD scheduler. It was not until FreeBSD 7.1 that the ULE scheduler became the new default. The ULE scheduler was an overhaul of the original scheduler, and allowed it support for symmetric multiprocessing (SMP), support for symmetric multithreading (SMT) on multi-core systems, and improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.
<more to come>

1 = http://www.cim.mcgill.ca/~franco/OpSys-304-427/lecture-notes/node46.html
2 = http://security.freebsd.org/advisories/FreeBSD-EN-10:02.sched_ule.asc
3 = McKusick, M. K. and Neville-Neil, G. V. 2004. Thread Scheduling in FreeBSD 5.2. Queue 2, 7 (Oct. 2004), 58-64. DOI= http://doi.acm.org/10.1145/1035594.1035622

Notes: Lots of this is just paraphrasing stuff you guys said in the discussion section. In terms of citations, should it be a superscripted citation next to the fact snippet we used, or should it just be a list of sources at the bottom?

--[[User:CFaibish|CFaibish]] 17:51, 13 October 2010 (UTC)

= Sources =

[1] http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

[2] http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt

[3] http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726

[4] http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt

Talk:COMP 3000 Essay 1 2010 Question 5

2010-10-13T17:54:28Z

CFaibish: /* Discussion */

=Discussion=

From what I have been reading the early versions of the Linux scheduler had a very hard time managing high numbers of tasks at the same time. Although I do not how it ran, the scheduler algorithm operated at O(n) time. As a result as more tasks were added, the scheduler would become slower. In addition to this, a single data structure was used to manage all processors of a system which created a problem with managing cached memory between processors. The Linux 2.6 scheduler was built to resolve the task management issues in O(1), constant, time as well as addressing the multiprocessing issues.

It appears as though BSD also had issues with task management however for BSD this was due to a locking mechanism that only allowed one process at a time to operate in kernel mode. FreeBSD 5 changed this locking mechanism to allow multiple processes the ability to run in kernel mode at the same time advancing the success of symmetric multiprocessing.

--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

Hi Mike,
Can you give any names for the schedulers you are talking about? I think it is easier to distinguish by names and not by the algorithm. It is just a suggestion!

The O(1) scheduler was replaced in the linux kernel 2.6.23 with the CFS (completly fair scheduler) which runs in O(log n). Also, the schedulers before CFS were based on a Multilevel feedback queue algorithm, which was changed in 2.6.23. It is not based on a queue as most schedulers, but on a red-black-tree to implement a timeline to make future predictions. The aim of CFS is to maximize CPU utilization and maximizing the performance at the same time.

In FreeBSD 5, the ULE Scheduler was introduced but disabled by default in the early versions, which eventually changed later on. ULE has better support for SMP and SMT, thus allowing it to improve overall performance in uniprocessors and multiprocessors. And it has a constant execution time, regardless of the amount of threads.

More information can be found here:
 
http://lwn.net/Articles/230574/
 
http://lwn.net/Articles/240474/

[[User:Sschnei1|Sschnei1]] 16:33, 3 October 2010 (UTC) or Sebastian

Here is another article which essentially backs up what you are saying Sebastian: http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156

Here are the highlights from the article:

General FreeBSD knowledge:
1. requires a scheduler to be selected at the time the kernel is built.
2. all calls to scheduling code are resolved at compile time...this means that the overhead of indirect function calls for scheduling decisions is eliminated.
3. kernels up to FreeBSD 5.1 used this scheduler, but from 5.2 onward the ULE scheduler used.

Original FreeBSD Scheduler:
1. threads assigned a scheduling priority which determines which 'run queue' the thread is placed in.
2. the system scans the run queues in order of highest priority to lowest priority and executes the first thread of the first non-empty run queue it finds.
3. once a non-empty queue is found the system spends an equal time slice on each thread in the run queue. This time slice is 0.1 seconds and this value has not changed in over 20 years. A shorter time slice would cause overhead due to switching between threads too often thus reducing productivity.
4. the article then provides detailed formulae on how to determine thread priority which is out of our scope for this project.

ULE Scheduler
- overhaul of Original BSD scheduler to:
1. support symmetric multiprocessing (SMP)
2. support symmetric multithreading (SMT) on multi-core systems
3. improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

Here is another article which gives some great overview of a bunch of versions/the evolution of different schedulers: https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf
 
Some interesting pieces about the Linux scheduler include:
1. The Jan 2002 version included O(1) algorithm as well as additions for SMP.
2. Scheduler uses 2 priority queue arrays to achieve fairness. Does this by giving each thread a time slice and a priority and executes each thread in order of highest priority to lowest. Threads that exhaust their time slice are moved to the exhausted queue and threads with remaining time slices are kept in the active queue.
3. Time slices are DYNAMIC, larger time slices are given to higher priority tasks, smaller slices to lower priority tasks.
 
I thought the dynamic time slice piece was of particular interest as you would think this would lead to starvation situations if the priority was high enough on one or multiple threads.
--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

This is essentially a summarized version of the aforementioned information regarding CFS (http://www.ibm.com/developerworks/linux/library/l-scheduler/).
--[[User:AbsMechanik|AbsMechanik]] 02:32, 4 October 2010 (UTC)

I have seen this website and thought it is useful. Do you think this is enough on research to write an essay or are we going to do some more research?
--[[User:Sschnei1|Sschnei1]] 09:38, 5 October 2010 (UTC)

I also stumbled upon this website: http://my.opera.com/blu3c4t/blog/show.dml/1531517. It explains a lot of stuff in layman's terms (I had a lot of trouble finding more info on the default BSD scheduler, but this link has some brief description included in it). I think we have enough resources/research done. We should start to formulate these results into an answer now. --[[User:AbsMechanik|AbsMechanik]] 20:08, 4 October 2010 (UTC)

So I thought I would take a first crack at an intro for our article, please tell me what you think of the following. Note that I have included the resource used as a footnote, the placement of which I indicate with the number 1, and I just tacked the details of the footnote on at the bottom:

See Essay preview section!

--[[User:Mike Preston|Mike Preston]] 02:54, 6 October 2010 (UTC)

I added a part to introduce the several schedulers for LINUX. We might need to change the reference, since I got it all from http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

-- [[User:Sschnei1|Sschnei1]] 19:27, 9 October 2010 (UTC)

Maybe we should write down our contact emails and names to write down who would like to write what part.

Another suggestion is that someone should read over the text and compare it to the references posted in the "Sources" section and check if someone is doing plagiarism.

Sebastian Schneider - sebastian@gamersblog.ca

Hi, here's a little forward on schedulers in relation to types of threads I've composed based off of one of my sources, I'm not sure if its necessary since there is one Mike typed below, but here it just for you guys to examine:

Threads that perform a lot of I/O require a fast response time to keep input and output devices busy, but need little CPU time. On the other hand, compute-bound threads need to receive a lot of CPU time to finish their work, but have no requirement for fast response time. Other threads lie somewhere in between, with periods of I/O punctuated by periods of computation, and thus have requirements that vary over time. A well-designed scheduler should be able accommodate threads with all these requirements simultaneously.

--[[User:CFaibish|CFaibish]] 17:54, 13 October 2010 (UTC)

= Essay Preview =

So just a small, quick question. Are we going to follow a certain standard for citing resources (bibliography & footnotes) to maintain consistency, or do we just stick with what Mike's presented?--[[User:AbsMechanik|AbsMechanik]] 12:53, 7 October 2010 (UTC)

Maybe we should write the essay templates/prototypes here, to keep overview of the discussion part.

Just relocating previous post with suggested intro paragraph:

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system.1 As computer hardware has increased in complexity, for example multiple core CPUs, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers; the default BSD/FreeBSD and Linux schedulers.

1 Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

--[[User:Mike Preston|Mike Preston]] 03:48, 7 October 2010 (UTC)

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be
efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea
of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was
the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its
predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a
scheduling event. The scheduler divided tasks into epochs, allowing each tasks to execute up to its time slice. If a task
did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute
longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and
did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware
architectures, such as multi-core processors.

Linux-2.6 introduced another scheduler up to Linux 2.6.23. Before Linux 2.6.23 an O(1) scheduler was used. It needed the
same amount of time for each task to execute, independent of how big the tasks were.It kept track of the tasks in a
running queue. The scheduler offered much more scalability. To determine if a task was I/O bound or processor bound the
scheduler used interactive metrics with numerous heuristics. Because the code was difficult to manage and the most part of
the code was to calculate heuristics, it was replaced in Linux 2.6.23 with the CFS scheduler, which is the current
scheduler in the actual Linux versions.

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining
fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor.
When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep
fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual
runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing
and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently.
Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu
are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The
scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted
into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side
of the tree are migrated to the left side to maintain fairness. [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html]

-- [[User:Sschnei1|Sschnei1]] 19:26, 9 October 2010 (UTC)

I've started writing a bit about the Linux O(1) scheduler:

Under a Linux system, scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.[http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt]

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.[http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726]

-- [[User:abondio2|Austin Bondio]] Last edit: 14:39, 12 October 2010

I'm writing on a contrast of the CFS scheduler right now, please don't edit it.

In contrast the the O(1) scheduler, CFS realizes the model of a scheduler which can execute precise on real multitasking on real hardware. Precise multitasking means that each process can run at equal speed. If 4 processes are running at the same time, CFS assigns 25% of the CPU time to each process. On real hardware, only one task can be executed at a time and other tasks have to wait, which gives the running tasks an unfair amount of CPU time.

To avoid an unfair balance over the processes, CFS has a wait run-time for each process. CFS tries to pick the process with the highest wait run-time value. To provide a real multitasking, CFS splits up the CPU time between running processes.

Processes are not stored in a run queue, but in a self-balancing red-black tree, where self-balancing means that the task with the highest need for CPU time is stored in the most left node. Tasks with a lower need for CPU time are stored on the right side of the Tree, where tasks with a higher need for CPU time are stored on the left side. The task on the left side is picked by the scheduler and given CPU time to run. The tree re-balances itself and new tasks can be inserted.

CFS is designed in a way that it does not need timeslicing. This is due to the nanosecond granularity, which removes the need for jiffies or other HZ details. [http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt]

-- [[User:Sschnei1|Sschnei1]] 16:32, 13 October 2010 (UTC)

Hey guys, sorry I've been non-existent for the past little bit, here's what I've done so far. I've been going through stuff on the 4BSD and ULE schedulers, here's what I have so far:

In order for FreeBSD to function, it requires a scheduler to be selected at the time the kernel is built. Also, all calls to scheduling code are resolved at compile time, meaning that the overhead of indirect function calls for scheduling decisions is eliminated.

[3] The 4BSD scheduler was a general-purpose scheduler. Its primary goal was to balance threads’ different scheduling requirements. FreeBSD's time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements and the amount consumed by the thread. Based on the thread's priority, it gets moved between run queues. When a new thread attains a higher priority than the currently running one, the system immediately switches to the new thread, if it's in user mode. Otherwise, the system switches as soon as the current thread leaves the kernel. The system scans the run queues in order of highest to lowest priority, and executes the first thread of the first non-empty run queue it finds. The system tailors it's short-term scheduling algorithm to favor user-interactive jobs by raising the priority of threads waiting for I/O for one or more seconds, and by lowering the priority of threads that hog up significant amounts of CPU time.

[1] In older BSD systems, (and I mean old, as in 20 or so years ago), a 1 second quantum was used for the round-robin scheduling algorithm. Later, in BSD 4.2, it did rescheduling every 0.1 seconds, and priority re-computation every second, and these values haven’t changed since. Round-robin scheduling is done by a timeout mechanism, which informs the clock interrupt driver to call a certain system routine after a specified interval. The subroutine to be called, in this case, causes the rescheduling and then resubmits a timeout to call itself again 0.1 sec later. The priority re-computation is also timed by a subroutine that resubmits a timeout for itself.

The ULE Scheduler was first introduced in FreeBSD 5, however disabled by default in favor of the default 4BSD scheduler. It was not until FreeBSD 7.1 that the ULE scheduler became the new default. The ULE scheduler was an overhaul of the original scheduler, and allowed it support for symmetric multiprocessing (SMP), support for symmetric multithreading (SMT) on multi-core systems, and improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.
<more to come>

Notes: Lots of this is just paraphrasing stuff you guys said in the discussion section. In terms of citations, should it be a superscripted citation next to the fact snippet we used, or should it just be a list of sources at the bottom?

--[[User:CFaibish|CFaibish]] 17:51, 13 October 2010 (UTC)

= Sources =

[1] http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

[2] http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt

[3] http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726

[4] http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt

Talk:COMP 3000 Essay 1 2010 Question 5

2010-10-13T17:54:18Z

CFaibish: /* Discussion */

=Discussion=

From what I have been reading the early versions of the Linux scheduler had a very hard time managing high numbers of tasks at the same time. Although I do not how it ran, the scheduler algorithm operated at O(n) time. As a result as more tasks were added, the scheduler would become slower. In addition to this, a single data structure was used to manage all processors of a system which created a problem with managing cached memory between processors. The Linux 2.6 scheduler was built to resolve the task management issues in O(1), constant, time as well as addressing the multiprocessing issues.

It appears as though BSD also had issues with task management however for BSD this was due to a locking mechanism that only allowed one process at a time to operate in kernel mode. FreeBSD 5 changed this locking mechanism to allow multiple processes the ability to run in kernel mode at the same time advancing the success of symmetric multiprocessing.

--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

Hi Mike,
Can you give any names for the schedulers you are talking about? I think it is easier to distinguish by names and not by the algorithm. It is just a suggestion!

The O(1) scheduler was replaced in the linux kernel 2.6.23 with the CFS (completly fair scheduler) which runs in O(log n). Also, the schedulers before CFS were based on a Multilevel feedback queue algorithm, which was changed in 2.6.23. It is not based on a queue as most schedulers, but on a red-black-tree to implement a timeline to make future predictions. The aim of CFS is to maximize CPU utilization and maximizing the performance at the same time.

In FreeBSD 5, the ULE Scheduler was introduced but disabled by default in the early versions, which eventually changed later on. ULE has better support for SMP and SMT, thus allowing it to improve overall performance in uniprocessors and multiprocessors. And it has a constant execution time, regardless of the amount of threads.

More information can be found here:
 
http://lwn.net/Articles/230574/
 
http://lwn.net/Articles/240474/

[[User:Sschnei1|Sschnei1]] 16:33, 3 October 2010 (UTC) or Sebastian

Here is another article which essentially backs up what you are saying Sebastian: http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156

Here are the highlights from the article:

General FreeBSD knowledge:
1. requires a scheduler to be selected at the time the kernel is built.
2. all calls to scheduling code are resolved at compile time...this means that the overhead of indirect function calls for scheduling decisions is eliminated.
3. kernels up to FreeBSD 5.1 used this scheduler, but from 5.2 onward the ULE scheduler used.

Original FreeBSD Scheduler:
1. threads assigned a scheduling priority which determines which 'run queue' the thread is placed in.
2. the system scans the run queues in order of highest priority to lowest priority and executes the first thread of the first non-empty run queue it finds.
3. once a non-empty queue is found the system spends an equal time slice on each thread in the run queue. This time slice is 0.1 seconds and this value has not changed in over 20 years. A shorter time slice would cause overhead due to switching between threads too often thus reducing productivity.
4. the article then provides detailed formulae on how to determine thread priority which is out of our scope for this project.

ULE Scheduler
- overhaul of Original BSD scheduler to:
1. support symmetric multiprocessing (SMP)
2. support symmetric multithreading (SMT) on multi-core systems
3. improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

Here is another article which gives some great overview of a bunch of versions/the evolution of different schedulers: https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf
 
Some interesting pieces about the Linux scheduler include:
1. The Jan 2002 version included O(1) algorithm as well as additions for SMP.
2. Scheduler uses 2 priority queue arrays to achieve fairness. Does this by giving each thread a time slice and a priority and executes each thread in order of highest priority to lowest. Threads that exhaust their time slice are moved to the exhausted queue and threads with remaining time slices are kept in the active queue.
3. Time slices are DYNAMIC, larger time slices are given to higher priority tasks, smaller slices to lower priority tasks.
 
I thought the dynamic time slice piece was of particular interest as you would think this would lead to starvation situations if the priority was high enough on one or multiple threads.
--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

This is essentially a summarized version of the aforementioned information regarding CFS (http://www.ibm.com/developerworks/linux/library/l-scheduler/).
--[[User:AbsMechanik|AbsMechanik]] 02:32, 4 October 2010 (UTC)

I have seen this website and thought it is useful. Do you think this is enough on research to write an essay or are we going to do some more research?
--[[User:Sschnei1|Sschnei1]] 09:38, 5 October 2010 (UTC)

I also stumbled upon this website: http://my.opera.com/blu3c4t/blog/show.dml/1531517. It explains a lot of stuff in layman's terms (I had a lot of trouble finding more info on the default BSD scheduler, but this link has some brief description included in it). I think we have enough resources/research done. We should start to formulate these results into an answer now. --[[User:AbsMechanik|AbsMechanik]] 20:08, 4 October 2010 (UTC)

So I thought I would take a first crack at an intro for our article, please tell me what you think of the following. Note that I have included the resource used as a footnote, the placement of which I indicate with the number 1, and I just tacked the details of the footnote on at the bottom:

See Essay preview section!

--[[User:Mike Preston|Mike Preston]] 02:54, 6 October 2010 (UTC)

I added a part to introduce the several schedulers for LINUX. We might need to change the reference, since I got it all from http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

-- [[User:Sschnei1|Sschnei1]] 19:27, 9 October 2010 (UTC)

Maybe we should write down our contact emails and names to write down who would like to write what part.

Another suggestion is that someone should read over the text and compare it to the references posted in the "Sources" section and check if someone is doing plagiarism.

Sebastian Schneider - sebastian@gamersblog.ca

Hi, here's a little forward on schedulers in relation to types of threads I've composed based off of one of my sources, I'm not sure if its necessary since there is one Mike typed below, but here it just for you guys to examine:

Threads that perform a lot of I/O require a fast response time to keep input and output devices busy, but need little CPU time. On the other hand, compute-bound threads need to receive a lot of CPU time to finish their work, but have no requirement for fast response time. Other threads lie somewhere in between, with periods of I/O punctuated by periods of computation, and thus have requirements that vary over time. A well-designed scheduler should be able accommodate threads with all these requirements simultaneously.

= Essay Preview =

So just a small, quick question. Are we going to follow a certain standard for citing resources (bibliography & footnotes) to maintain consistency, or do we just stick with what Mike's presented?--[[User:AbsMechanik|AbsMechanik]] 12:53, 7 October 2010 (UTC)

Maybe we should write the essay templates/prototypes here, to keep overview of the discussion part.

Just relocating previous post with suggested intro paragraph:

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system.1 As computer hardware has increased in complexity, for example multiple core CPUs, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers; the default BSD/FreeBSD and Linux schedulers.

1 Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

--[[User:Mike Preston|Mike Preston]] 03:48, 7 October 2010 (UTC)

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be
efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea
of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was
the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its
predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a
scheduling event. The scheduler divided tasks into epochs, allowing each tasks to execute up to its time slice. If a task
did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute
longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and
did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware
architectures, such as multi-core processors.

Linux-2.6 introduced another scheduler up to Linux 2.6.23. Before Linux 2.6.23 an O(1) scheduler was used. It needed the
same amount of time for each task to execute, independent of how big the tasks were.It kept track of the tasks in a
running queue. The scheduler offered much more scalability. To determine if a task was I/O bound or processor bound the
scheduler used interactive metrics with numerous heuristics. Because the code was difficult to manage and the most part of
the code was to calculate heuristics, it was replaced in Linux 2.6.23 with the CFS scheduler, which is the current
scheduler in the actual Linux versions.

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining
fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor.
When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep
fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual
runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing
and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently.
Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu
are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The
scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted
into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side
of the tree are migrated to the left side to maintain fairness. [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html]

-- [[User:Sschnei1|Sschnei1]] 19:26, 9 October 2010 (UTC)

I've started writing a bit about the Linux O(1) scheduler:

Under a Linux system, scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.[http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt]

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.[http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726]

-- [[User:abondio2|Austin Bondio]] Last edit: 14:39, 12 October 2010

I'm writing on a contrast of the CFS scheduler right now, please don't edit it.

In contrast the the O(1) scheduler, CFS realizes the model of a scheduler which can execute precise on real multitasking on real hardware. Precise multitasking means that each process can run at equal speed. If 4 processes are running at the same time, CFS assigns 25% of the CPU time to each process. On real hardware, only one task can be executed at a time and other tasks have to wait, which gives the running tasks an unfair amount of CPU time.

To avoid an unfair balance over the processes, CFS has a wait run-time for each process. CFS tries to pick the process with the highest wait run-time value. To provide a real multitasking, CFS splits up the CPU time between running processes.

Processes are not stored in a run queue, but in a self-balancing red-black tree, where self-balancing means that the task with the highest need for CPU time is stored in the most left node. Tasks with a lower need for CPU time are stored on the right side of the Tree, where tasks with a higher need for CPU time are stored on the left side. The task on the left side is picked by the scheduler and given CPU time to run. The tree re-balances itself and new tasks can be inserted.

CFS is designed in a way that it does not need timeslicing. This is due to the nanosecond granularity, which removes the need for jiffies or other HZ details. [http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt]

-- [[User:Sschnei1|Sschnei1]] 16:32, 13 October 2010 (UTC)

Hey guys, sorry I've been non-existent for the past little bit, here's what I've done so far. I've been going through stuff on the 4BSD and ULE schedulers, here's what I have so far:

In order for FreeBSD to function, it requires a scheduler to be selected at the time the kernel is built. Also, all calls to scheduling code are resolved at compile time, meaning that the overhead of indirect function calls for scheduling decisions is eliminated.

[3] The 4BSD scheduler was a general-purpose scheduler. Its primary goal was to balance threads’ different scheduling requirements. FreeBSD's time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements and the amount consumed by the thread. Based on the thread's priority, it gets moved between run queues. When a new thread attains a higher priority than the currently running one, the system immediately switches to the new thread, if it's in user mode. Otherwise, the system switches as soon as the current thread leaves the kernel. The system scans the run queues in order of highest to lowest priority, and executes the first thread of the first non-empty run queue it finds. The system tailors it's short-term scheduling algorithm to favor user-interactive jobs by raising the priority of threads waiting for I/O for one or more seconds, and by lowering the priority of threads that hog up significant amounts of CPU time.

[1] In older BSD systems, (and I mean old, as in 20 or so years ago), a 1 second quantum was used for the round-robin scheduling algorithm. Later, in BSD 4.2, it did rescheduling every 0.1 seconds, and priority re-computation every second, and these values haven’t changed since. Round-robin scheduling is done by a timeout mechanism, which informs the clock interrupt driver to call a certain system routine after a specified interval. The subroutine to be called, in this case, causes the rescheduling and then resubmits a timeout to call itself again 0.1 sec later. The priority re-computation is also timed by a subroutine that resubmits a timeout for itself.

The ULE Scheduler was first introduced in FreeBSD 5, however disabled by default in favor of the default 4BSD scheduler. It was not until FreeBSD 7.1 that the ULE scheduler became the new default. The ULE scheduler was an overhaul of the original scheduler, and allowed it support for symmetric multiprocessing (SMP), support for symmetric multithreading (SMT) on multi-core systems, and improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.
<more to come>

Notes: Lots of this is just paraphrasing stuff you guys said in the discussion section. In terms of citations, should it be a superscripted citation next to the fact snippet we used, or should it just be a list of sources at the bottom?

--[[User:CFaibish|CFaibish]] 17:51, 13 October 2010 (UTC)

= Sources =

[1] http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

[2] http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt

[3] http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726

[4] http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt

Talk:COMP 3000 Essay 1 2010 Question 5

2010-10-13T17:51:43Z

CFaibish: /* Essay Preview */

=Discussion=

From what I have been reading the early versions of the Linux scheduler had a very hard time managing high numbers of tasks at the same time. Although I do not how it ran, the scheduler algorithm operated at O(n) time. As a result as more tasks were added, the scheduler would become slower. In addition to this, a single data structure was used to manage all processors of a system which created a problem with managing cached memory between processors. The Linux 2.6 scheduler was built to resolve the task management issues in O(1), constant, time as well as addressing the multiprocessing issues.

It appears as though BSD also had issues with task management however for BSD this was due to a locking mechanism that only allowed one process at a time to operate in kernel mode. FreeBSD 5 changed this locking mechanism to allow multiple processes the ability to run in kernel mode at the same time advancing the success of symmetric multiprocessing.

--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

Hi Mike,
Can you give any names for the schedulers you are talking about? I think it is easier to distinguish by names and not by the algorithm. It is just a suggestion!

The O(1) scheduler was replaced in the linux kernel 2.6.23 with the CFS (completly fair scheduler) which runs in O(log n). Also, the schedulers before CFS were based on a Multilevel feedback queue algorithm, which was changed in 2.6.23. It is not based on a queue as most schedulers, but on a red-black-tree to implement a timeline to make future predictions. The aim of CFS is to maximize CPU utilization and maximizing the performance at the same time.

In FreeBSD 5, the ULE Scheduler was introduced but disabled by default in the early versions, which eventually changed later on. ULE has better support for SMP and SMT, thus allowing it to improve overall performance in uniprocessors and multiprocessors. And it has a constant execution time, regardless of the amount of threads.

More information can be found here:
 
http://lwn.net/Articles/230574/
 
http://lwn.net/Articles/240474/

[[User:Sschnei1|Sschnei1]] 16:33, 3 October 2010 (UTC) or Sebastian

Here is another article which essentially backs up what you are saying Sebastian: http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156

Here are the highlights from the article:

General FreeBSD knowledge:
1. requires a scheduler to be selected at the time the kernel is built.
2. all calls to scheduling code are resolved at compile time...this means that the overhead of indirect function calls for scheduling decisions is eliminated.
3. kernels up to FreeBSD 5.1 used this scheduler, but from 5.2 onward the ULE scheduler used.

Original FreeBSD Scheduler:
1. threads assigned a scheduling priority which determines which 'run queue' the thread is placed in.
2. the system scans the run queues in order of highest priority to lowest priority and executes the first thread of the first non-empty run queue it finds.
3. once a non-empty queue is found the system spends an equal time slice on each thread in the run queue. This time slice is 0.1 seconds and this value has not changed in over 20 years. A shorter time slice would cause overhead due to switching between threads too often thus reducing productivity.
4. the article then provides detailed formulae on how to determine thread priority which is out of our scope for this project.

ULE Scheduler
- overhaul of Original BSD scheduler to:
1. support symmetric multiprocessing (SMP)
2. support symmetric multithreading (SMT) on multi-core systems
3. improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

Here is another article which gives some great overview of a bunch of versions/the evolution of different schedulers: https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf
 
Some interesting pieces about the Linux scheduler include:
1. The Jan 2002 version included O(1) algorithm as well as additions for SMP.
2. Scheduler uses 2 priority queue arrays to achieve fairness. Does this by giving each thread a time slice and a priority and executes each thread in order of highest priority to lowest. Threads that exhaust their time slice are moved to the exhausted queue and threads with remaining time slices are kept in the active queue.
3. Time slices are DYNAMIC, larger time slices are given to higher priority tasks, smaller slices to lower priority tasks.
 
I thought the dynamic time slice piece was of particular interest as you would think this would lead to starvation situations if the priority was high enough on one or multiple threads.
--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

This is essentially a summarized version of the aforementioned information regarding CFS (http://www.ibm.com/developerworks/linux/library/l-scheduler/).
--[[User:AbsMechanik|AbsMechanik]] 02:32, 4 October 2010 (UTC)

I have seen this website and thought it is useful. Do you think this is enough on research to write an essay or are we going to do some more research?
--[[User:Sschnei1|Sschnei1]] 09:38, 5 October 2010 (UTC)

I also stumbled upon this website: http://my.opera.com/blu3c4t/blog/show.dml/1531517. It explains a lot of stuff in layman's terms (I had a lot of trouble finding more info on the default BSD scheduler, but this link has some brief description included in it). I think we have enough resources/research done. We should start to formulate these results into an answer now. --[[User:AbsMechanik|AbsMechanik]] 20:08, 4 October 2010 (UTC)

So I thought I would take a first crack at an intro for our article, please tell me what you think of the following. Note that I have included the resource used as a footnote, the placement of which I indicate with the number 1, and I just tacked the details of the footnote on at the bottom:

See Essay preview section!

--[[User:Mike Preston|Mike Preston]] 02:54, 6 October 2010 (UTC)

I added a part to introduce the several schedulers for LINUX. We might need to change the reference, since I got it all from http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

-- [[User:Sschnei1|Sschnei1]] 19:27, 9 October 2010 (UTC)

Maybe we should write down our contact emails and names to write down who would like to write what part.

Another suggestion is that someone should read over the text and compare it to the references posted in the "Sources" section and check if someone is doing plagiarism.

Sebastian Schneider - sebastian@gamersblog.ca

= Essay Preview =

So just a small, quick question. Are we going to follow a certain standard for citing resources (bibliography & footnotes) to maintain consistency, or do we just stick with what Mike's presented?--[[User:AbsMechanik|AbsMechanik]] 12:53, 7 October 2010 (UTC)

Maybe we should write the essay templates/prototypes here, to keep overview of the discussion part.

Just relocating previous post with suggested intro paragraph:

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system.1 As computer hardware has increased in complexity, for example multiple core CPUs, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers; the default BSD/FreeBSD and Linux schedulers.

1 Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

--[[User:Mike Preston|Mike Preston]] 03:48, 7 October 2010 (UTC)

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be
efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea
of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was
the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its
predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a
scheduling event. The scheduler divided tasks into epochs, allowing each tasks to execute up to its time slice. If a task
did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute
longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and
did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware
architectures, such as multi-core processors.

Linux-2.6 introduced another scheduler up to Linux 2.6.23. Before Linux 2.6.23 an O(1) scheduler was used. It needed the
same amount of time for each task to execute, independent of how big the tasks were.It kept track of the tasks in a
running queue. The scheduler offered much more scalability. To determine if a task was I/O bound or processor bound the
scheduler used interactive metrics with numerous heuristics. Because the code was difficult to manage and the most part of
the code was to calculate heuristics, it was replaced in Linux 2.6.23 with the CFS scheduler, which is the current
scheduler in the actual Linux versions.

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining
fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor.
When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep
fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual
runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing
and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently.
Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu
are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The
scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted
into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side
of the tree are migrated to the left side to maintain fairness. [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html]

-- [[User:Sschnei1|Sschnei1]] 19:26, 9 October 2010 (UTC)

I've started writing a bit about the Linux O(1) scheduler:

Under a Linux system, scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.[http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt]

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.[http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726]

-- [[User:abondio2|Austin Bondio]] Last edit: 14:39, 12 October 2010

I'm writing on a contrast of the CFS scheduler right now, please don't edit it.

In contrast the the O(1) scheduler, CFS realizes the model of a scheduler which can execute precise on real multitasking on real hardware. Precise multitasking means that each process can run at equal speed. If 4 processes are running at the same time, CFS assigns 25% of the CPU time to each process. On real hardware, only one task can be executed at a time and other tasks have to wait, which gives the running tasks an unfair amount of CPU time.

To avoid an unfair balance over the processes, CFS has a wait run-time for each process. CFS tries to pick the process with the highest wait run-time value. To provide a real multitasking, CFS splits up the CPU time between running processes.

Processes are not stored in a run queue, but in a self-balancing red-black tree, where self-balancing means that the task with the highest need for CPU time is stored in the most left node. Tasks with a lower need for CPU time are stored on the right side of the Tree, where tasks with a higher need for CPU time are stored on the left side. The task on the left side is picked by the scheduler and given CPU time to run. The tree re-balances itself and new tasks can be inserted.

CFS is designed in a way that it does not need timeslicing. This is due to the nanosecond granularity, which removes the need for jiffies or other HZ details. [http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt]

-- [[User:Sschnei1|Sschnei1]] 16:32, 13 October 2010 (UTC)

Hey guys, sorry I've been non-existent for the past little bit, here's what I've done so far. I've been going through stuff on the 4BSD and ULE schedulers, here's what I have so far:

In order for FreeBSD to function, it requires a scheduler to be selected at the time the kernel is built. Also, all calls to scheduling code are resolved at compile time, meaning that the overhead of indirect function calls for scheduling decisions is eliminated.

[3] The 4BSD scheduler was a general-purpose scheduler. Its primary goal was to balance threads’ different scheduling requirements. FreeBSD's time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements and the amount consumed by the thread. Based on the thread's priority, it gets moved between run queues. When a new thread attains a higher priority than the currently running one, the system immediately switches to the new thread, if it's in user mode. Otherwise, the system switches as soon as the current thread leaves the kernel. The system scans the run queues in order of highest to lowest priority, and executes the first thread of the first non-empty run queue it finds. The system tailors it's short-term scheduling algorithm to favor user-interactive jobs by raising the priority of threads waiting for I/O for one or more seconds, and by lowering the priority of threads that hog up significant amounts of CPU time.

[1] In older BSD systems, (and I mean old, as in 20 or so years ago), a 1 second quantum was used for the round-robin scheduling algorithm. Later, in BSD 4.2, it did rescheduling every 0.1 seconds, and priority re-computation every second, and these values haven’t changed since. Round-robin scheduling is done by a timeout mechanism, which informs the clock interrupt driver to call a certain system routine after a specified interval. The subroutine to be called, in this case, causes the rescheduling and then resubmits a timeout to call itself again 0.1 sec later. The priority re-computation is also timed by a subroutine that resubmits a timeout for itself.

The ULE Scheduler was first introduced in FreeBSD 5, however disabled by default in favor of the default 4BSD scheduler. It was not until FreeBSD 7.1 that the ULE scheduler became the new default. The ULE scheduler was an overhaul of the original scheduler, and allowed it support for symmetric multiprocessing (SMP), support for symmetric multithreading (SMT) on multi-core systems, and improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.
<more to come>

Notes: Lots of this is just paraphrasing stuff you guys said in the discussion section. In terms of citations, should it be a superscripted citation next to the fact snippet we used, or should it just be a list of sources at the bottom?

--[[User:CFaibish|CFaibish]] 17:51, 13 October 2010 (UTC)

= Sources =

[1] http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

[2] http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt

[3] http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726

[4] http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt

Talk:COMP 3000 Essay 1 2010 Question 5

2010-10-13T17:51:32Z

CFaibish: /* Essay Preview */

=Discussion=

From what I have been reading the early versions of the Linux scheduler had a very hard time managing high numbers of tasks at the same time. Although I do not how it ran, the scheduler algorithm operated at O(n) time. As a result as more tasks were added, the scheduler would become slower. In addition to this, a single data structure was used to manage all processors of a system which created a problem with managing cached memory between processors. The Linux 2.6 scheduler was built to resolve the task management issues in O(1), constant, time as well as addressing the multiprocessing issues.

It appears as though BSD also had issues with task management however for BSD this was due to a locking mechanism that only allowed one process at a time to operate in kernel mode. FreeBSD 5 changed this locking mechanism to allow multiple processes the ability to run in kernel mode at the same time advancing the success of symmetric multiprocessing.

--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

Hi Mike,
Can you give any names for the schedulers you are talking about? I think it is easier to distinguish by names and not by the algorithm. It is just a suggestion!

The O(1) scheduler was replaced in the linux kernel 2.6.23 with the CFS (completly fair scheduler) which runs in O(log n). Also, the schedulers before CFS were based on a Multilevel feedback queue algorithm, which was changed in 2.6.23. It is not based on a queue as most schedulers, but on a red-black-tree to implement a timeline to make future predictions. The aim of CFS is to maximize CPU utilization and maximizing the performance at the same time.

In FreeBSD 5, the ULE Scheduler was introduced but disabled by default in the early versions, which eventually changed later on. ULE has better support for SMP and SMT, thus allowing it to improve overall performance in uniprocessors and multiprocessors. And it has a constant execution time, regardless of the amount of threads.

More information can be found here:
 
http://lwn.net/Articles/230574/
 
http://lwn.net/Articles/240474/

[[User:Sschnei1|Sschnei1]] 16:33, 3 October 2010 (UTC) or Sebastian

Here is another article which essentially backs up what you are saying Sebastian: http://delivery.acm.org/10.1145/1040000/1035622/p58-mckusick.pdf?key1=1035622&key2=8828216821&coll=GUIDE&dl=GUIDE&CFID=104236685&CFTOKEN=84340156

Here are the highlights from the article:

General FreeBSD knowledge:
1. requires a scheduler to be selected at the time the kernel is built.
2. all calls to scheduling code are resolved at compile time...this means that the overhead of indirect function calls for scheduling decisions is eliminated.
3. kernels up to FreeBSD 5.1 used this scheduler, but from 5.2 onward the ULE scheduler used.

Original FreeBSD Scheduler:
1. threads assigned a scheduling priority which determines which 'run queue' the thread is placed in.
2. the system scans the run queues in order of highest priority to lowest priority and executes the first thread of the first non-empty run queue it finds.
3. once a non-empty queue is found the system spends an equal time slice on each thread in the run queue. This time slice is 0.1 seconds and this value has not changed in over 20 years. A shorter time slice would cause overhead due to switching between threads too often thus reducing productivity.
4. the article then provides detailed formulae on how to determine thread priority which is out of our scope for this project.

ULE Scheduler
- overhaul of Original BSD scheduler to:
1. support symmetric multiprocessing (SMP)
2. support symmetric multithreading (SMT) on multi-core systems
3. improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.

Here is another article which gives some great overview of a bunch of versions/the evolution of different schedulers: https://www.usenix.org/events/bsdcon03/tech/full_papers/roberson/roberson.pdf
 
Some interesting pieces about the Linux scheduler include:
1. The Jan 2002 version included O(1) algorithm as well as additions for SMP.
2. Scheduler uses 2 priority queue arrays to achieve fairness. Does this by giving each thread a time slice and a priority and executes each thread in order of highest priority to lowest. Threads that exhaust their time slice are moved to the exhausted queue and threads with remaining time slices are kept in the active queue.
3. Time slices are DYNAMIC, larger time slices are given to higher priority tasks, smaller slices to lower priority tasks.
 
I thought the dynamic time slice piece was of particular interest as you would think this would lead to starvation situations if the priority was high enough on one or multiple threads.
--[[User:Mike Preston|Mike Preston]] 18:38, 3 October 2010 (UTC)

This is essentially a summarized version of the aforementioned information regarding CFS (http://www.ibm.com/developerworks/linux/library/l-scheduler/).
--[[User:AbsMechanik|AbsMechanik]] 02:32, 4 October 2010 (UTC)

I have seen this website and thought it is useful. Do you think this is enough on research to write an essay or are we going to do some more research?
--[[User:Sschnei1|Sschnei1]] 09:38, 5 October 2010 (UTC)

I also stumbled upon this website: http://my.opera.com/blu3c4t/blog/show.dml/1531517. It explains a lot of stuff in layman's terms (I had a lot of trouble finding more info on the default BSD scheduler, but this link has some brief description included in it). I think we have enough resources/research done. We should start to formulate these results into an answer now. --[[User:AbsMechanik|AbsMechanik]] 20:08, 4 October 2010 (UTC)

So I thought I would take a first crack at an intro for our article, please tell me what you think of the following. Note that I have included the resource used as a footnote, the placement of which I indicate with the number 1, and I just tacked the details of the footnote on at the bottom:

See Essay preview section!

--[[User:Mike Preston|Mike Preston]] 02:54, 6 October 2010 (UTC)

I added a part to introduce the several schedulers for LINUX. We might need to change the reference, since I got it all from http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

-- [[User:Sschnei1|Sschnei1]] 19:27, 9 October 2010 (UTC)

Maybe we should write down our contact emails and names to write down who would like to write what part.

Another suggestion is that someone should read over the text and compare it to the references posted in the "Sources" section and check if someone is doing plagiarism.

Sebastian Schneider - sebastian@gamersblog.ca

= Essay Preview =

So just a small, quick question. Are we going to follow a certain standard for citing resources (bibliography & footnotes) to maintain consistency, or do we just stick with what Mike's presented?--[[User:AbsMechanik|AbsMechanik]] 12:53, 7 October 2010 (UTC)

Maybe we should write the essay templates/prototypes here, to keep overview of the discussion part.

Just relocating previous post with suggested intro paragraph:

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system.1 As computer hardware has increased in complexity, for example multiple core CPUs, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers; the default BSD/FreeBSD and Linux schedulers.

1 Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

--[[User:Mike Preston|Mike Preston]] 03:48, 7 October 2010 (UTC)

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be
efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed. It now used the idea
of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was
the first scheduler which supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its
predecessors, but it also has more features. The running time was O(n) because it iterated over each task during a
scheduling event. The scheduler divided tasks into epochs, allowing each tasks to execute up to its time slice. If a task
did not use up all of its time slice, the remaining time was added to the next time slice to allow the task to execute
longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability and
did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware
architectures, such as multi-core processors.

Linux-2.6 introduced another scheduler up to Linux 2.6.23. Before Linux 2.6.23 an O(1) scheduler was used. It needed the
same amount of time for each task to execute, independent of how big the tasks were.It kept track of the tasks in a
running queue. The scheduler offered much more scalability. To determine if a task was I/O bound or processor bound the
scheduler used interactive metrics with numerous heuristics. Because the code was difficult to manage and the most part of
the code was to calculate heuristics, it was replaced in Linux 2.6.23 with the CFS scheduler, which is the current
scheduler in the actual Linux versions.

As of the Linux 2.6.23 introduction the CFS scheduler took its place in the kernel. CFS uses the idea of maintaining
fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor.
When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep
fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual
runtime.

The model how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing
and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently.
Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu
are stored in the right side of the tree. To keep fairness the scheduler takes the left most node from the tree. The
scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable the task then is inserted
into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side
of the tree are migrated to the left side to maintain fairness. [http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html]

-- [[User:Sschnei1|Sschnei1]] 19:26, 9 October 2010 (UTC)

I've started writing a bit about the Linux O(1) scheduler:

Under a Linux system, scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.[http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt]

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.[http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726]

-- [[User:abondio2|Austin Bondio]] Last edit: 14:39, 12 October 2010

I'm writing on a contrast of the CFS scheduler right now, please don't edit it.

In contrast the the O(1) scheduler, CFS realizes the model of a scheduler which can execute precise on real multitasking on real hardware. Precise multitasking means that each process can run at equal speed. If 4 processes are running at the same time, CFS assigns 25% of the CPU time to each process. On real hardware, only one task can be executed at a time and other tasks have to wait, which gives the running tasks an unfair amount of CPU time.

To avoid an unfair balance over the processes, CFS has a wait run-time for each process. CFS tries to pick the process with the highest wait run-time value. To provide a real multitasking, CFS splits up the CPU time between running processes.

Processes are not stored in a run queue, but in a self-balancing red-black tree, where self-balancing means that the task with the highest need for CPU time is stored in the most left node. Tasks with a lower need for CPU time are stored on the right side of the Tree, where tasks with a higher need for CPU time are stored on the left side. The task on the left side is picked by the scheduler and given CPU time to run. The tree re-balances itself and new tasks can be inserted.

CFS is designed in a way that it does not need timeslicing. This is due to the nanosecond granularity, which removes the need for jiffies or other HZ details. [http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt]

-- [[User:Sschnei1|Sschnei1]] 16:32, 13 October 2010 (UTC)

Hey guys, sorry I've been non-existent for the past little bit, here's what I've done so far. I've been going through stuff on the 4BSD and ULE schedulers, here's what I have so far:

In order for FreeBSD to function, it requires a scheduler to be selected at the time the kernel is built. Also, all calls to scheduling code are resolved at compile time, meaning that the overhead of indirect function calls for scheduling decisions is eliminated.

[3] The 4BSD scheduler was a general-purpose scheduler. Its primary goal was to balance threads’ different scheduling requirements. FreeBSD's time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements and the amount consumed by the thread. Based on the thread's priority, it gets moved between run queues. When a new thread attains a higher priority than the currently running one, the system immediately switches to the new thread, if it's in user mode. Otherwise, the system switches as soon as the current thread leaves the kernel. The system scans the run queues in order of highest to lowest priority, and executes the first thread of the first non-empty run queue it finds. The system tailors it's short-term scheduling algorithm to favor user-interactive jobs by raising the priority of threads waiting for I/O for one or more seconds, and by lowering the priority of threads that hog up significant amounts of CPU time.

[1] In older BSD systems, (and I mean old, as in 20 or so years ago), a 1 second quantum was used for the round-robin scheduling algorithm. Later, in BSD 4.2, it did rescheduling every 0.1 seconds, and priority re-computation every second, and these values haven’t changed since. Round-robin scheduling is done by a timeout mechanism, which informs the clock interrupt driver to call a certain system routine after a specified interval. The subroutine to be called, in this case, causes the rescheduling and then resubmits a timeout to call itself again 0.1 sec later. The priority re-computation is also timed by a subroutine that resubmits a timeout for itself.

The ULE Scheduler was first introduced in FreeBSD 5, however disabled by default in favor of the default 4BSD scheduler. It was not until FreeBSD 7.1 that the ULE scheduler became the new default. The ULE scheduler was an overhaul of the original scheduler, and allowed it support for symmetric multiprocessing (SMP), support for symmetric multithreading (SMT) on multi-core systems, and improve the scheduler algorithm to ensure execution is no longer limited by the number of threads in the system.
<more to come>

Notes: Lots of this is just paraphrasing stuff you guys said in the discussion section. In terms of citations, should it be a superscripted citation next to the fact snippet we used, or should it just be a list of sources at the bottom?

= Sources =

[1] http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html

[2] http://www.mjmwired.net/kernel/Documentation/scheduler/sched-nice-design.txt

[3] http://oreilly.com/catalog/linuxkernel/chapter/ch10.html#94726

[4] http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.txt