COMP 3000 Essay 2 2010 Question 3

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

Paper

The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [1] for further details on specifics of the essay. It is essential for one to comprehend the basic concepts of the language that is spoken in the paper in order to fully understand the ideas that are being discussed. The most important notions discussed in the FlexSC paper that are at the core of it all, are System calls[21], and Sychronous systems. These base definitons along with numerous other helpful ideas can be understood through the section of the paper to follow.

Background Concepts:

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper. It is more vital to the reader to understand the core ideas of these definitions, along with the underlying motivation for their existence, than to understand the miniscule details of their processes.

System Call

A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4]

Mode Switch

Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are switching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1]

Synchronous System Call

Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2]

Asynchronous System Call

An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3]

System Call Pollution

System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3]

Pipeline Flushing

The regular operation of a CPU has multiple instructions being fetched, decoded and executed at the same time. The parallel processing of instructions provides a significant speed advantage in processing. During a mode switch, however, instructions in the user-mode pipeline are flushed and removed from the processor registers.[1] These lost instructions are part of the cost of a mode switch.

Processor Exceptions

Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5]

System Call Batching

System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6]

Temporal and Spatial Locality

Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8]

Instructions Per Cycle (IPC)

Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9]

Translation Look-Aside Buffer (TLB)

A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

Lack of Locality

As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation.

Throughput

Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

Regular Store Instructions

A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored.

Linux Application Binary Interface (ABI)

The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18]

Native POSIX Thread Library (NPTL)

NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

Syscall Page

A syscall page is a collection of syscall entries. In turn, a sysentry is a 64-byte data structure, which includes information such as syscall number, number of arguments, the arguments, status, and return value [1].

Syscall Threads

Syscall threads is FlexSC's method to allow exception-less system calls. A syscall thread shares its process virtual address space [1].

Latency

Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

Research Problem:

System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1]

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1]

Contribution:

Exception-Less System Calls

Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through:

1. System Call Batching:
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1]

2. Core Specialization
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1]

3. Exception-less System Call Interface
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1]

4. Decoupling Execution from Invocation
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1]

FlexSC Threads

As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3]

FlexSC Threads implement an M-on-N threading model, where M is the number of user based threads and N is the number of syscall threads, M being greater than N [1]. This is similar to one of the common styles of implementing threads, referred to as M:N model [22], where M represents user space threads, and N represents Kernel threads. The M-on-N model however uses FlexSC’s exception-less system call mechanism (transparently), i.e. its inner workings differ, from the otherwise typical M:N model. The M-on-N model takes advantage of maximizing the number of user space threads switching. In other words, as long as a user space thread is ready, work will continue in user space, minimizing thus the time spent on blocking.

Implementation and Statistics

Up until this point, we have discussed and illustrated the theory behind FlexSC's logic patterns. However, it is also as important to demonstrate the practical and statistical results during implementation. FlexSC's implementation has shown a very large leap from today's synchronous system call method currently in use. Firstly, demonstration of the current synchronous systems statistics will be discussed. The data collected by the researchers show that the System Calls cause a large delay in the User IPC(Interprocess Communication) between 20% and 60%. To get a general idea of how many instructions are executed in one system call, we name a few with their corresponding number of instructions. According to the researchers of this paper, pread(), pwrite(), open+close() require 3739, 5689 and 6631 instructions respectively. Therefore, these system calls, are not minuscule lines of code, that can be overlooked, they consume a large amount of time, hence FlexSC's attempt to lower this overall value by methods mentioned above in the paper such as system call batching.

On a single core, the System-Less Batching produces an improved time from 55 nanoseconds to about 35 nanoseconds once 15 or more batch requests have been utilized. Moreover, the throughput(requests/sec) shows a dramatic increase of 10,000 and more in Apache throughput of Linux. This throughput increases with the more cores implemented in the system. This demonstrates the theory of how the FlexSC's main application improvements in systems implementing multiple cores.

Furthermore, on a 4 core system, 14% of the original 50% idle time is being used in the User Space now. Latency as well is decreased by half of the original time on 1,2 and 4 core implementations. Also very interesting test case the researchers investigated was the idea of having numerous requests occurring at once in the system. They chose to execute 256 synchronous requests in the system and what they came to find is that FlexSC's method is 60ms faster than the current synchronous method and continually increases with the more cores in use.

Through the different methods being executed by the FlexSC's concepts, the statistics taken of the system demonstrates that their theory does hold in practice. This illustrates that we can begin implementing this system into our computers we use every day and begin to see immediate results. Now this effect of the statistics occurs when multiple cores occur as well when batching occurs numerous times. Otherwise if these two factors are very small then no change occurs or even a decrease in efficiency can occur.

Critique:

Moore's Law

One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance.

Performance of FlexSC

It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient.

Blocking Calls

FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation.

Core Scheduling Issues

In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications.

When There Are Not More Threads Then Cores

In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

IO

FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated
with interrupts in IO intensive tasks.

Some Kernel Changes Are Required

Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1].
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel.

Multicore Systems

For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism.
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time,
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ?

Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for
specialized purposes ?

Style

The overall reading of the paper was fairly understandable. Some deeper knowledge of System Calls and Synchronous systems is definitely recommended, almost necessary for this papers true comprehension. The paper was explained very well and even though we did need to research some specific background or similar topics, it was a well written paper. A reader with a base knowledge of Operation Systems can understand this paper for the core concept and know where to go to investigate further details and descriptions that are lacking in the paper.

Related Work:

System Call Batching

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately.

Locality of Execution and Multicores

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism.

Non-blocking Execution

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution.

References:

[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.PDF

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.PDF

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.PDF

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.PDF

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.PDF

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. PDF

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.PDF

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.HTML

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.PDF

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. HTML

[18] Linux ABI sourceforge page. HTML and Linux application page. HTML

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. HTML

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. PDF

[21] DeveloperWorks, Kernel Command using Linux System Calls IBM,2010.HTML

[22] McCracken, Dave. POSIX Threads and the Linux Kernel, IBM Linux Technology Center, Austin, TX. In Proceedings of the Ottawa Linux Symposium, 2002. PDF