COMP 3000 Essay 2 2010 Question 3

From Soma-notes

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls


Paper

The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [1] for further details on specifics of the essay.

Background Concepts:

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

System Call

A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4]

Mode Switch

Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1]

Synchronous System Call

Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2]

Asynchronous System Call

An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3]

System Call Pollution

System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3]

Processor Exceptions

Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5]

System Call Batching

System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6]

Temporal and Spatial Locality

Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8]

Instructions Per Cycle (IPC)

Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9]

will add the following terms.
TODO-Start --Tafatah 16:13, 30 November 2010 (UTC)

TLB

Lack of Locality

Throughput

Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

Regular Store Instructions

Linux ABI

NPTL

Syscall Page

Syscall Threads

Inter-Process Interrupt

Latency

Producer-Consumer Problem

Note: explain the relationship

TODO End --Tafatah 16:13, 30 November 2010 (UTC)

Research Problem:

System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1]

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1]

Contribution:

Exception-Less System Calls

Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through:

1. System Call Batching:
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.

2. Core Specialization
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1]

3. Exception-less System Call Interface
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1]

4. Decoupling Execution from Invocation
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1]

FlexSC Threads

As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3]

Critique:

Moore's Law

One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance.

Performance of FlexSC

It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient.

Blocking Calls

FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation.

Core Scheduling Issues

In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications.

When There Are Not More Threads Then Cores

In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

Not Done Yet

The following will be finalized on Wed. --Tafatah 02:26, 1 December 2010 (UTC)
START
Good
Most of the work is done transparently. i.e. there is no need for application
code modification.

Not for IO intensive tasks

Small kernel change (3 lines of code as per section 3.2). That means adopters, and
after each update of the kernel, would have to add/modify the referenced lines and
then recompile the kernel

For a multicore system (section 3, mostly 3.5), the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running syscall threads. It is unclear how the dynamic
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism.

Further, the paper mentions that a predefined, static list of cores is used for syscall threads assignments. It is unclear when that list is created. Is it at installation time,
is it generated initially, or does the installer have to do any manual work.

Scaling with increased cores is ambiguous
Further, it is not that clear how scalable the scheduler is. One gets the impression that it is very scalable due to the fact that each core
spawns a syscall thread. Thus, as many threads as there are cores could be running concurrently, including for the same process. More explicit
results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis of the results.
Understandable, however, it would be nice to know if these threads ( 2 per core ) would actually be treated as a core when turned on ? I.e. would
the scheduler then realize that it can use eight cores ? Does that also mean the predefined static cores list would have to be modified ?

No testing with GPU
Given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance
effects when using specialized GPUs, like NVIDIA's Tesla GPUs for example.
END

Related Work:

System Call Batching

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately.

Locality of Execution and Multicores

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism.

Non-blocking Execution

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution.



References:

[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.PDF

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.PDF

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.PDF

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.PDF

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.PDF

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. PDF

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.PDF

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.HTML

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.PDF