Group 3 Essay

Hello everyone, please post your contact information here:

Rey Arteaga: rarteaga@connect.carleton.ca

Corey Faibish

Fangchen Sun: sfangche@connect.carleton.ca

Wesley L. Lawrence: wlawrenc@connect.carleton.ca

Can't access the video without a login as we found out in class, but you can listen to the speech and follow with the slides pretty easily, I just went through it and it's not too bad. Rarteaga

Question 3 Group

Abdul-Fatah Tawfic tafatah
Arteaga Reynaldo rarteaga
Faibish Corey cfaibish
Lawrence Wesley wlawrenc
Preston Mike mpreston
Robson Benjamin brobson
Sun Fangchen sfangche

Who is working on what ?

Just to keep track of who's doing what --Tafatah 01:37, 15 November 2010 (UTC)

Hey everyone, I have taken the liberty of trying to provide a good first start on our paper. I have provided many resources and filled in information for all of the sections. This is not complete, but it should make the rest of the work a lot easier. Please go through and add in pieces that I am missing (specifically in the Critique section) and then we can put this essay to bed. Also, please note that below I have included my notes on the paper so that if anyone feels they do not have time to read the paper, they can read my notes instead and still find additional materials to contribute with. --Mike Preston 18:22, 20 November 2010 (UTC)

Man, Mike: you did a nice job! I'm reading through it now very thorough :) Since you pretty much turned all of your bulleted points from the discussion page into that on the main page, what else needs to be done? Just expanding on each topic and sub-topic? Or are there untouched concepts/topics that we should be addressing? Oh and question two: Should we turn the Q&A from the end of the video of the presentation into information for the Critique section? --CFaibish 20:34, 22 November 2010 (UTC)

Mike, thnx for the great job! I basically finished the part of related work based on your draft. --Fangchen Sun 17:40, 24 November 2010 (UTC)

No problem, And great additions. In terms of what needs to be done, I do believe that adding some detail to the critique is where we really need some focus. Using the Q&A from the video is probably a great source of inspiration, maybe just take a look at the topics presented, see if additional material from other sources can be obtain and use those sources to address any pros or cons to this artical. Remember, the critique section can be agreeing or disagreeing with what is presented in the actual paper. --Mike Preston 15:12, 28 November 2010 (UTC)

I noticied we needed some work in the Critique section, so I listened to the Q&A session at the end of the FlexSC mp3 talk, and took some quick notes. There seems to be 3 good ones (of the 9) that I picked out. I'll summarize them and add to the Critique section, specifically questions 3, 6, and 7. If anyone else wants to have listen to a specific question, and maybe try to do some more 'critiquing' here is a list of what time the questions each take place, and a very general statement on what the question is about, and the very general answer:

1 - 22:30
Q: Did the paper consider Upstream patches(?)
A:No

2 - 23:00
Q: Security issues with the pages
A:Pages pre-processor, no issue

3 - 24:10
Q: What about blocking calls (read/write)?
A: Not handled

4 - 25:50
Q: ?
A: Not a problem? (Personally didn't understand question, don't believe it's important, but anyone whose willing should double check)

5 - 28:00
Q: Compare pollution between user thread switching to user-kernel thread switching?
A: No, only looked at and measured pollution when switching user-to-kernel.

6 - 29:30
Q: Scheduling problems of what cores are 'system' core, and what cores are 'user' cores
A: Very simple algorithm, but not tested when running multiple apps, would need to be "fine-tuned"

7 - 31:00
Q: Situations where FlexSC is bad, when running less or equal threads to the number of cores, such as "Scientific programs", mostly in userspace where one thread has 100% CPU resource
A: Agrees, FlexSC is not to be used for such situations

8 - 33:00
Q: Problems with un-related threads demanding service, how does it scale? Issue with frequency of polling could cause sys calls to take time to preform
A: (Would be answered offline)

9 - 34:30
Q: Backwards compatability and robustness
A: Only an issue with getTID (Thread ID), needed a small patch.

--Wesley Lawrence 20:31, 29 November 2010 (UTC)

Wrote information in Critique for questions 3, 6 and 7 (Blocking Calls, Core Scheduling Issues, and When There Are Not More Threads Then Cores). If you feel any additions need to be made, please feel free to add them. Most importantly, I'm not sure how to cite these. All information as obtained from the mp3 of the presentation, could some one let me know how to go about citing this?

--Wesley Lawrence 21:05, 29 November 2010 (UTC)

I'm going to run through the whole paper, and just make sure everything makes sense, and fill in the holes where needed. I'll also add my own thoughts along the way. Feel free to do the same.-Rarteaga

Added 3 sections to the critique, definitions for the remaining terms (thanks Corey for taking care of some of these) and did some editing. My plan is to add some more flesh to the FlexSC-Threads section. I'll do that sometime before 3PM on Thursday. I'll also go over the paper at that time in case something needs some editing. --Tafatah 06:38, 2 December 2010 (UTC)

I'm going to be working on the contributions section under Implementation and demonstrating some statistics they showed in the paper. Rarteaga

Paper Summary

I am not sure if everyone has taken the time to examine the paper closely, so I thought I would provide my notes on the paper so that anyone who has not read it could have a view of the high points.

Abstract:

  - System calls are the accepted way to request services from the OS kernel, historical implementation.
  - System calls almost always synchronous 
  - Aim to demonstrate how synchronous system calls negatively affect performance due mainly to pipeline flushing and pollution of key processor structures (TLB, data and instruction caches, etc.)
       o TLB is translation lookaside buffer which is uses pages (data and code pages) to speed up virtual translation speed.
  - Propose exception-less system calls to improve the current system call process.
       o Improve processor efficiency via enabling flexible scheduling of OS work which in turn reduces size of execution both in kernel and user space thus reducing pollution effects on processor structures.
  - Exception-less system calls especially effective on multi-core systems running multi-threaded applications.
  - FlexSC is an implementation of exception-less system calls in the Linux kernel with accompanying user-mode threads from FlexSC-Threads package.
       o Flex-SC-Threads convert legacy system calls into exception-less system calls.

Introduction:

- Synchronous system calls have a negative impact on system performance due to:
o Direct costs – mode switching
o Indirect costs – pollution of important processor structures
- Traditional system calls:
o Involve writing arguments to appropriate registers as well as issuing a special machine instruction which raises a synchronous exception.
o A processor exception is used to communicate with the kernel.
o Synchronous execution is enforced as the application expects the completion of the system call before user-mode execution resumes.
- Moore’s Law has provided large increases to performance potential of software while at the same time widening the gap between the performance of efficient and inefficient software.
o This gap is mainly caused by disparity of accessing different processor resources (registers, caches, memory)
- Server and system-intensive workloads are known to perform well below processor potential throughput.
o These are the items the researchers are mostly interested in.
o The cause is often described as due to the lack of locality.
o The researchers state this lack of locality is in part a result of the current synchronous system calls.
- When a synchronous system call, like pwrite, is called, the instruction per cycle level drops significantly and it takes many (in the example 14,000) cycles of execution for the instruction per cycle rate
to return to the level it was at before the system (pwrite) call.
- Exception-less System Call:
o Request for kernel services that does not require the use of synchronous processor exceptions.
o System calls are written to a reserved syscall page.
o Execution of system calls is performed asynchronously by special kernel level syscall threads. The result of the execution is stored on the syscall page after execution.
- By separating system call execution from system call invocation, the system can now have flexible system call scheduling.
o This allows system calls to be executed in batches, increasing the temporal locality of execution.
o Also provides a way to execute system calls on a separate core, in parallel to user-mode thread execution. This provides spatial per-core locality.
o An additional side effect is that now a multi-core system can have individual cores designated to run either user-mode or kernel mode execution dynamically depending on the current system load.
- In order to implement the exception-less system calls, the research team suggests adding a new M-on-N threading package.
o M user-mode threads executing on N kernel-visible threads.
o This would allow the threading package to harvest independent system calls, by switching threads, in user-mode, whenever a thread invokes a system call.

The (Real) Cost of System Calls

  - Traditional way to measure the performance cost of system calls is the mode switch time. This is the time necessary to execute the system call instruction in user-mode, resume execution in kernel mode and
then return execution back to the user-mode.
  - Mode switch in modern processors is a processor exception.
       o Flush the user-mode pipeline, save registers onto the kernel stack, change the protection domain and redirect execution to the proper exception handler.
  - Another measure of the performance of a system call is the state pollution caused by the system call.
       o State pollution is the measure of how much user-mode data is overwritten in places like the TLB, cache (L1, L2, L3), branch prediction tables with kernel leel execution instructions for the system call. 
       o This data must be re-populated upon the return to user-mode.
  - Potentially the most significant measure of cost of system calls is the performance impact on a running application.
       o Ideally, user-mode instructions per cycle should not decrease as a result of a system call.
       o Synchronous system calls do cause a drop in user-mode IPC  due to; direct overhead -  the processor exception associated with the system call which flushes the processor pipeline; and indirect overhead
– system call pollution on processors structures.

Exception-less System calls:

  - System call batching
       o By delaying a series of system calls and executing them in batches you can minimize the frequency of mode switches between user and kernel mode.
       o Improves both the direct and indirect cost of system calls.
  - Core specialization
       o A system call can be scheduled on a different core then the core on which it was invoked, only for exception-less system calls.
       o Provides ability to designate a core to run all system calls.
  - Exception-less Syscall Interface
       o Set of memory pages shared between user and kernel modes. Referred to as Syscall pages.
       o User-space threads find a free entry in a syscall page and place a request for a system call. The user-space thread can then continue executing without interruption and must then return to the syscall
page to get the return value from the system call.
       o Neither issuing the system call (via the syscall page) nor getting the return value generate an exception.
  - Syscall pages
       o Each page is a table of syscall entries.
       o Each syscall entre has a state:
                Free – means a syscall can be added her
                Submitted – means the kernel can proceed to invoke the appropriate system call operations.
                Done – means the kernel is finished and has provided the return value to the syscall entry. User space thread must return and get the return value from the page.
  - Decoupling Execution from Invocation
       o To separate these two concepts a special kernel thread, syscall thread, is used.
       o Sole purpose is to pull requests from syscall pages and execute them always in kernel mode.
       o Syscall threads provide the ability to schedule the system calls on specific cores.

System Calls Galore – FlexSC-Threads

  - Programming for exception-less system calls requires a different and more complex way of interacting with the kernel for OS functionality.
       o The researchers describe working with exception-less system calls as being similar to event-driven programming in that you do not get the same sequential execution of code as you do with synchronous
system calls.
       o In event-driven servers, the researchers suggest using a hybrid of both exception-less system calls (for performance critical paths) and regular synchronous system calls (for less critical system calls).

FlexSC-Threads

  - Threading package which transforms synchronous system calls into exception-less system calls.
  - Intended use is with server-type applications with which have many user-mode threads (like Apache or MySQL).
  - Compatible with both POSIX threads and the default Linux thread library.
       o As a result, multi-threaded Linux programs are immediately compatible with FlexSC threads without modification.
  - For multi-core systems, a single kernel level thread is created for each core of the system. Multiple user-mode threads are multiplexed onto each kernel level thread via interactions with the syscall pages.
       o The syscall pages are private to each kernel level thread, this means each core of a system has a syscall page from which it will receive system calls.

Overhead:

  - When running a single exception-less system call against a single synchronous system call, the exception-less call was slower.
  - When running a batch of exception-less system calls compared to a bunch of synchronous system calls, the exception-less system calls were much faster.
  - The same is true for a remote server situation, one synchronous call is much faster than one exception-less system call but a batch of exception-less system calls is faster than the same number
of synchronous system calls.

Related Work:

- System Call Batching
o Operating systems have a concept called multi-calls which involves collecting multiple system calls and submitting them as a single system call.
o The Cassyopia compiler has an additional process called a looped multi-call where the result of one system call can be fed as an argument to another system call in the same multi-call.
o Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls do.
 Multi-call system calls are executed sequentially, each one must complete before the next may start.
- Locality of Execution and Multicores
o Other techniques include Soft Timers and Lazy Receiver Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to
limit processor interference associated with interrupt handling without affecting the latency of servicing requests.
o Computation Spreading is another locality process which is similar to FlexSC.
 Processor modifications that allow hardware migration of threads and migration to specialized cores.
 Did not model TLBs and on current hardware synchronous thread migration is a costly interprocessor interrupt.
o Also have proposals for dedicating CPU cores to specific operating system functionality.
 These solutions require a microkernel system.
 Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically.
- Non-blocking Execution
o Past research on improving system call performance has focused on blocking versus non-blocking behaviour.
 Typically researchers used threading, event-based (non-blocking) and hybrid systems to obtain high performance on server applications.
o Main difference between past research and FlexSC is that none of the past proposals have decoupled system call execution from system call invocation.

--Mike Preston 04:03, 20 November 2010 (UTC)