Soma-notes - User contributions [en]

COMP 3000 Essay 2 2010 Question 3

2010-12-03T01:58:35Z

Tafatah: /* Implementation and Statistics */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay. It is essential for one to comprehend the basic concepts of the language that is spoken in the paper in order to fully understand the ideas that are being discussed. The most important notions discussed in the FlexSC paper that are at the core of it all, are System calls[21], and Sychronous systems. These base definitons along with numerous other helpful ideas can be understood through the section of the paper to follow.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper. It is more vital to the reader to understand the core ideas of these definitions, along with the underlying motivation for their existence, than to understand the miniscule details of their processes.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are switching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Pipeline Flushing===
The regular operation of a CPU has multiple instructions being fetched, decoded and executed at the same time. The parallel processing of instructions provides a significant speed advantage in processing. During a mode switch, however, instructions in the user-mode pipeline are flushed and removed from the processor registers.[1] These lost instructions are part of the cost of a mode switch. 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17] 

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation. 

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151] 

===Regular Store Instructions ===
A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored. 

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18] 

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19] 

===Syscall Page ===
A syscall page is a collection of syscall entries. In turn, a sysentry is a 64-byte data structure, which includes information such as syscall number, number of arguments,
the arguments, status, and return value [1]. 

===Syscall Threads ===
Syscall threads is FlexSC's method to allow exception-less system calls. A syscall thread shares its process virtual address space [1]. 

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20] 

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

FlexSC Threads implement an M-on-N threading model, where M is the number of user based threads and N is the number of syscall threads, M being greater than N [1]. This is similar to one of the common styles of implementing threads, referred to as M:N model [22], where M represents user space threads, and N represents Kernel threads. The M-on-N model however uses FlexSC’s exception-less system call mechanism (transparently), i.e. its inner workings differ, from the otherwise typical M:N model. The M-on-N model takes advantage of maximizing the number of user space threads switching. In other words, as long as a user space thread is ready, work will continue in user space, minimizing thus the time spent on blocking.

===Implementation and Statistics===
Up until this point, we have discussed and illustrated the theory behind FlexSC's logic patterns. However, it is also as important to demonstrate the practical and statistical results during implementation. FlexSC's implementation has shown a very large leap from today's synchronous system call method currently in use.
Firstly, demonstration of the current synchronous systems statistics will be discussed. The data collected by the researchers show that the System Calls cause a large delay in the User IPC(Interprocess Communication) between 20% and 60%. To get a general idea of how many instructions are executed in one system call, we name a few with their corresponding number of instructions. According to the researchers of this paper, pread(), pwrite(), open+close() require 3739, 5689 and 6631 instructions respectively. Therefore, these system calls, are not minuscule lines of code, that can be overlooked, they consume a large amount of time, hence FlexSC's attempt to lower this overall value by methods mentioned above in the paper such as system call batching.

On a single core, the System-Less Batching produces an improved time from 55 nanoseconds to about 35 nanoseconds once 15 or more batch requests have been utilized. Moreover, the throughput(requests/sec) shows a dramatic increase of 10,000 and more in Apache throughput of Linux. This throughput increases with the more cores implemented in the system. This demonstrates the theory of how the FlexSC's main application improvements in systems implementing multiple cores.

Furthermore, on a 4 core system, 14% of the original 50% idle time is being used in the User Space now. Latency as well is decreased by half of the original time on 1,2 and 4 core implementations. Also very interesting test case the researchers investigated was the idea of having numerous requests occurring at once in the system. They chose to execute 256 synchronous requests in the system and what they came to find is that FlexSC's method is 60ms faster than the current synchronous method and continually increases with the more cores in use.

Through the different methods being executed by the FlexSC's concepts, the statistics taken of the system demonstrates that their theory does hold in practice. This illustrates that we can being implementing this system into our computers we use every day and begin to see immediate results.

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

[21] DeveloperWorks, Kernel Command using Linux System Calls IBM,2010.[http://www.ibm.com/developerworks/linux/library/l-system-calls/ HTML]

[22] McCracken, Dave. POSIX Threads and the Linux Kernel, IBM Linux Technology Center, Austin, TX. In Proceedings of the Ottawa Linux Symposium, 2002. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.2.8887&rep=rep1&type=pdf#page=330 PDF]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T18:37:17Z

Tafatah: /* FlexSC Threads */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay. It is essential for one to comprehend the basic concepts of the language that is spoken in the paper in order to fully understand the ideas that are being discussed. The most important notions discussed in the FlexSC paper that are at the core of it all, are System calls[21], and Sychronous systems. These base definitons along with numerous other helpful ideas can be understood through the section of the paper to follow.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper. It is more vital to the reader to understand the core ideas of these definitions, along with the underlying motivation for their existence, than to understand the miniscule details of their processes.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are switching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Pipeline Flushing===
The regular operation of a CPU has multiple instructions being fetched, decoded and executed at the same time. The parallel processing of instructions provides a significant speed advantage in processing. During a mode switch, however, instructions in the user-mode pipeline are flushed and removed from the processor registers.[1] These lost instructions are part of the cost of a mode switch. 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17] 

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation. 

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151] 

===Regular Store Instructions ===
A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored. 

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18] 

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19] 

===Syscall Page ===
A syscall page is a collection of syscall entries. In turn, a sysentry is a 64-byte data structure, which includes information such as syscall number, number of arguments,
the arguments, status, and return value [1]. 

===Syscall Threads ===
Syscall threads is FlexSC's method to allow exception-less system calls. A syscall thread shares its process virtual address space [1]. 

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20] 

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

FlexSC Threads implement an M-on-N threading model, where M is the number of user based threads and N is the number of syscall threads, M being greater than N [1]. This is similar to one of the common styles of implementing threads, referred to as M:N model [22], where M represents user space threads, and N represents Kernel threads. The M-on-N model however uses FlexSC’s exception-less system call mechanism (transparently), i.e. its inner workings differ, from the otherwise typical M:N model. The M-on-N model takes advantage of maximizing the number of user space threads switching. In other words, as long as a user space thread is ready, work will continue in user space, minimizing thus the time spent on blocking.

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

[21] DeveloperWorks, Kernel Command using Linux System Calls IBM,2010.[http://www.ibm.com/developerworks/linux/library/l-system-calls/ HTML]

[22] McCracken, Dave. POSIX Threads and the Linux Kernel, IBM Linux Technology Center, Austin, TX. In Proceedings of the Ottawa Linux Symposium, 2002. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.2.8887&rep=rep1&type=pdf#page=330 PDF]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T18:33:40Z

Tafatah: /* Background Concepts: */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay. It is essential for one to comprehend the basic concepts of the language that is spoken in the paper in order to fully understand the ideas that are being discussed. The most important notions discussed in the FlexSC paper that are at the core of it all, are System calls[21], and Sychronous systems. These base definitons along with numerous other helpful ideas can be understood through the section of the paper to follow.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper. It is more vital to the reader to understand the core ideas of these definitions, along with the underlying motivation for their existence, than to understand the miniscule details of their processes.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are switching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Pipeline Flushing===
The regular operation of a CPU has multiple instructions being fetched, decoded and executed at the same time. The parallel processing of instructions provides a significant speed advantage in processing. During a mode switch, however, instructions in the user-mode pipeline are flushed and removed from the processor registers.[1] These lost instructions are part of the cost of a mode switch. 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17] 

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation. 

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151] 

===Regular Store Instructions ===
A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored. 

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18] 

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19] 

===Syscall Page ===
A syscall page is a collection of syscall entries. In turn, a sysentry is a 64-byte data structure, which includes information such as syscall number, number of arguments,
the arguments, status, and return value [1]. 

===Syscall Threads ===
Syscall threads is FlexSC's method to allow exception-less system calls. A syscall thread shares its process virtual address space [1]. 

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20] 

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

FlexSC Threads implements an M-on-N threading model, where M is the number of user based threads and N is the number of syscall threads, M being greater than N [1]. This is similar to one of the common styles of implementing threads, referred to as M:N model [22], where M represents user space threads, and N represents Kernel threads. The M-on-N model however uses FlexSC’s exception-less system call mechanism (transparently), i.e. its inner workings differ, from the otherwise typical M:N model. The M-on-N model takes advantage of maximizing the number of user space threads switching. In other words, as long as a user space thread is ready, work will continue in user space, minimizing thus the time spent on blocking.

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

[21] DeveloperWorks, Kernel Command using Linux System Calls IBM,2010.[http://www.ibm.com/developerworks/linux/library/l-system-calls/ HTML]

[22] McCracken, Dave. POSIX Threads and the Linux Kernel, IBM Linux Technology Center, Austin, TX. In Proceedings of the Ottawa Linux Symposium, 2002. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.2.8887&rep=rep1&type=pdf#page=330 PDF]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T18:28:25Z

Tafatah: /* References: */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay. It is essential for one to comprehend the basic concepts of the language that is spoken in the paper in order to fully understand the ideas that are being discussed. The most important notions discussed in the FlexSC paper that are at the core of it all, are System calls[21], and Sychronous systems. These base definitons along with numerous other helpful ideas can be understood through the section of the paper to follow.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper. It is more vital to the reader to understand the core ideas of these defintions along with the underlying motivation for there existance, then to understand the miniscule details of their processes.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Pipeline Flushing===
The regular operation of a CPU has multiple instructions being fetched, decoded and executed at the same time. The parallel processing of instructions provides a significant speed advantage in processing. During a mode switch, however, instructions in the user-mode pipeline are flushed and removed from the processor registers.[1] These lost instructions are part of the cost of a mode switch. 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17] 

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation. 

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151] 

===Regular Store Instructions ===
A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored. 

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18] 

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19] 

===Syscall Page ===
A syscall page is a collection of syscall entries. In turn, a sysentry is a 64-byte data structure, which includes information such as syscall number, number of arguments,
the arguments, status, and return value [1]. 

===Syscall Threads ===
Syscall threads is FlexSC's method to allow exception-less system calls. A syscall thread shares its process virtual address space [1]. 

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20] 

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

FlexSC Threads implements an M-on-N threading model, where M is the number of user based threads and N is the number of syscall threads, M being greater than N [1]. This is similar to one of the common styles of implementing threads, referred to as M:N model [22], where M represents user space threads, and N represents Kernel threads. The M-on-N model however uses FlexSC’s exception-less system call mechanism (transparently), i.e. its inner workings differ, from the otherwise typical M:N model. The M-on-N model takes advantage of maximizing the number of user space threads switching. In other words, as long as a user space thread is ready, work will continue in user space, minimizing thus the time spent on blocking.

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

[21] DeveloperWorks, Kernel Command using Linux System Calls IBM,2010.[http://www.ibm.com/developerworks/linux/library/l-system-calls/ HTML]

[22] McCracken, Dave. POSIX Threads and the Linux Kernel, IBM Linux Technology Center, Austin, TX. In Proceedings of the Ottawa Linux Symposium, 2002. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.2.8887&rep=rep1&type=pdf#page=330 PDF]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T18:27:35Z

Tafatah: /* References: */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay. It is essential for one to comprehend the basic concepts of the language that is spoken in the paper in order to fully understand the ideas that are being discussed. The most important notions discussed in the FlexSC paper that are at the core of it all, are System calls[21], and Sychronous systems. These base definitons along with numerous other helpful ideas can be understood through the section of the paper to follow.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper. It is more vital to the reader to understand the core ideas of these defintions along with the underlying motivation for there existance, then to understand the miniscule details of their processes.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Pipeline Flushing===
The regular operation of a CPU has multiple instructions being fetched, decoded and executed at the same time. The parallel processing of instructions provides a significant speed advantage in processing. During a mode switch, however, instructions in the user-mode pipeline are flushed and removed from the processor registers.[1] These lost instructions are part of the cost of a mode switch. 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17] 

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation. 

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151] 

===Regular Store Instructions ===
A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored. 

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18] 

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19] 

===Syscall Page ===
A syscall page is a collection of syscall entries. In turn, a sysentry is a 64-byte data structure, which includes information such as syscall number, number of arguments,
the arguments, status, and return value [1]. 

===Syscall Threads ===
Syscall threads is FlexSC's method to allow exception-less system calls. A syscall thread shares its process virtual address space [1]. 

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20] 

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

FlexSC Threads implements an M-on-N threading model, where M is the number of user based threads and N is the number of syscall threads, M being greater than N [1]. This is similar to one of the common styles of implementing threads, referred to as M:N model [22], where M represents user space threads, and N represents Kernel threads. The M-on-N model however uses FlexSC’s exception-less system call mechanism (transparently), i.e. its inner workings differ, from the otherwise typical M:N model. The M-on-N model takes advantage of maximizing the number of user space threads switching. In other words, as long as a user space thread is ready, work will continue in user space, minimizing thus the time spent on blocking.

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

[21] DeveloperWorks, Kernel Command using Linux System Calls IBM,2010.[http://www.ibm.com/developerworks/linux/library/l-system-calls/ ]

[22] McCracken, Dave. POSIX Threads and the Linux Kernel, IBM Linux Technology Center, Austin, TX. In Proceedings of the Ottawa Linux Symposium, 2002. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.2.8887&rep=rep1&type=pdf#page=330 PDF]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T18:18:53Z

Tafatah: /* FlexSC Threads */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay. It is essential for one to comprehend the basic concepts of the language that is spoken in the paper in order to fully understand the ideas that are being discussed. The most important notions discussed in the FlexSC paper that are at the core of it all, are System calls[21], and Sychronous systems. These base definitons along with numerous other helpful ideas can be understood through the section of the paper to follow.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper. It is more vital to the reader to understand the core ideas of these defintions along with the underlying motivation for there existance, then to understand the miniscule details of their processes.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Pipeline Flushing===
The regular operation of a CPU has multiple instructions being fetched, decoded and executed at the same time. The parallel processing of instructions provides a significant speed advantage in processing. During a mode switch, however, instructions in the user-mode pipeline are flushed and removed from the processor registers.[1] These lost instructions are part of the cost of a mode switch. 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17] 

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation. 

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151] 

===Regular Store Instructions ===
A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored. 

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18] 

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19] 

===Syscall Page ===
A syscall page is a collection of syscall entries. In turn, a sysentry is a 64-byte data structure, which includes information such as syscall number, number of arguments,
the arguments, status, and return value [1]. 

===Syscall Threads ===
Syscall threads is FlexSC's method to allow exception-less system calls. A syscall thread shares its process virtual address space [1]. 

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20] 

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

FlexSC Threads implements an M-on-N threading model, where M is the number of user based threads and N is the number of syscall threads, M being greater than N [1]. This is similar to one of the common styles of implementing threads, referred to as M:N model [22], where M represents user space threads, and N represents Kernel threads. The M-on-N model however uses FlexSC’s exception-less system call mechanism (transparently), i.e. its inner workings differ, from the otherwise typical M:N model. The M-on-N model takes advantage of maximizing the number of user space threads switching. In other words, as long as a user space thread is ready, work will continue in user space, minimizing thus the time spent on blocking.

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

[21] DeveloperWorks, Kernel Command using Linux System Calls IBM,2010.[http://www.ibm.com/developerworks/linux/library/l-system-calls/ ]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T18:18:15Z

Tafatah: /* FlexSC Threads */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay. It is essential for one to comprehend the basic concepts of the language that is spoken in the paper in order to fully understand the ideas that are being discussed. The most important notions discussed in the FlexSC paper that are at the core of it all, are System calls[21], and Sychronous systems. These base definitons along with numerous other helpful ideas can be understood through the section of the paper to follow.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper. It is more vital to the reader to understand the core ideas of these defintions along with the underlying motivation for there existance, then to understand the miniscule details of their processes.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Pipeline Flushing===
The regular operation of a CPU has multiple instructions being fetched, decoded and executed at the same time. The parallel processing of instructions provides a significant speed advantage in processing. During a mode switch, however, instructions in the user-mode pipeline are flushed and removed from the processor registers.[1] These lost instructions are part of the cost of a mode switch. 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17] 

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation. 

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151] 

===Regular Store Instructions ===
A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored. 

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18] 

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19] 

===Syscall Page ===
A syscall page is a collection of syscall entries. In turn, a sysentry is a 64-byte data structure, which includes information such as syscall number, number of arguments,
the arguments, status, and return value [1]. 

===Syscall Threads ===
Syscall threads is FlexSC's method to allow exception-less system calls. A syscall thread shares its process virtual address space [1]. 

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20] 

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

FlexSC Threads implements an M-on-N threading model, where M is the number of user based threads and N is the number of syscall threads, M being greater than N [1]. This is similar to one of the common styles of implementing threads, referred to as M:N model [21], where M represents user space threads, and N represents Kernel threads. The M-on-N model however uses FlexSC’s exception-less system call mechanism (transparently), i.e. its inner workings differ, from the otherwise typical M:N model. The M-on-N model takes advantage of maximizing the number of user space threads switching. In other words, as long as a user space thread is ready, work will continue in user space, minimizing thus the time spent on blocking.

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

[21] DeveloperWorks, Kernel Command using Linux System Calls IBM,2010.[http://www.ibm.com/developerworks/linux/library/l-system-calls/ ]

Talk:COMP 3000 Essay 2 2010 Question 3

2010-12-02T06:38:55Z

Tafatah: /* Who is working on what ? */

=Group 3 Essay=

Hello everyone, please post your contact information here:

Ben Robson [mailto:brobson@connect.carleton.ca brobson@connect.carleton.ca]

Rey Arteaga: rarteaga@connect.carleton.ca

Corey Faibish: [mailto:corey.faibish@gmail.com corey.faibish@gmail.com]

Tawfic Abdul-Fatah: [mailto:tfatah@gmail.com tfatah@gmail.com]

Fangchen Sun: [mailto:sfangche@connect.carleton.ca sfangche@connect.carleton.ca]

Mike Preston: [mailto:michaelapreston@gmail.com michaelapreston@gmail.com]

Wesley L. Lawrence: [mailto:wlawrenc@connect.carleton.ca wlawrenc@connect.carleton.ca]

Can't access the video without a login as we found out in class, but you can listen to the speech and follow with the slides pretty easily, I just went through it and it's not too bad. Rarteaga

==Question 3 Group==
*Abdul-Fatah Tawfic tafatah
*Arteaga Reynaldo rarteaga
*Faibish Corey cfaibish
*Lawrence Wesley wlawrenc
*Preston Mike mpreston
*Robson Benjamin brobson
*Sun Fangchen sfangche

==Who is working on what ?==
Just to keep track of who's doing what --[[User:Tafatah|Tafatah]] 01:37, 15 November 2010 (UTC)

Hey everyone, I have taken the liberty of trying to provide a good first start on our paper. I have provided many resources and filled in information for all of the sections. This is not complete, but it should make the rest of the work a lot easier. Please go through and add in pieces that I am missing (specifically in the Critique section) and then we can put this essay to bed. Also, please note that below I have included my notes on the paper so that if anyone feels they do not have time to read the paper, they can read my notes instead and still find additional materials to contribute with.
--[[User:Mike Preston|Mike Preston]] 18:22, 20 November 2010 (UTC)

Man, Mike: you did a nice job! I'm reading through it now very thorough :) Since you pretty much turned all of your bulleted points from the discussion page into that on the main page, what else needs to be done? Just expanding on each topic and sub-topic? Or are there untouched concepts/topics that we should be addressing?
Oh and question two: Should we turn the Q&A from the end of the video of the presentation into information for the ''Critique'' section?
--[[User:CFaibish|CFaibish]] 20:34, 22 November 2010 (UTC)

Mike, thnx for the great job! I basically finished the part of related work based on your draft.
--[[User:sfangchen|Fangchen Sun]] 17:40, 24 November 2010 (UTC)

No problem, And great additions.
In terms of what needs to be done, I do believe that adding some detail to the critique is where we really need some focus. Using the Q&A from the video is probably a great source of inspiration, maybe just take a look at the topics presented, see if additional material from other sources can be obtain and use those sources to address any pros or cons to this artical. Remember, the critique section can be agreeing or disagreeing with what is presented in the actual paper.
--[[User:Mike Preston|Mike Preston]] 15:12, 28 November 2010 (UTC)

I noticied we needed some work in the Critique section, so I listened to the Q&A session at the end of the FlexSC mp3 talk, and took some quick notes. There seems to be 3 good ones (of the 9) that I picked out. I'll summarize them and add to the Critique section, specifically questions 3, 6, and 7. If anyone else wants to have listen to a specific question, and maybe try to do some more 'critiquing' here is a list of what time the questions each take place, and a very general statement on what the question is about, and the very general answer:
 1 - 22:30 Q: Did the paper consider Upstream patches(?) A:No
 2 - 23:00 Q: Security issues with the pages A:Pages pre-processor, no issue
 3 - 24:10 Q: What about blocking calls (read/write)? A: Not handled
 4 - 25:50 Q: ? A: Not a problem? (Personally didn't understand question, don't believe it's important, but anyone whose willing should double check)
 5 - 28:00 Q: Compare pollution between user thread switching to user-kernel thread switching? A: No, only looked at and measured pollution when switching user-to-kernel.
 6 - 29:30 Q: Scheduling problems of what cores are 'system' core, and what cores are 'user' cores A: Very simple algorithm, but not tested when running multiple apps, would need to be "fine-tuned"
 7 - 31:00 Q: Situations where FlexSC is bad, when running less or equal threads to the number of cores, such as "Scientific programs", mostly in userspace where one thread has 100% CPU resource A: Agrees, FlexSC is not to be used for such situations
 8 - 33:00 Q: Problems with un-related threads demanding service, how does it scale? Issue with frequency of polling could cause sys calls to take time to preform A: (Would be answered offline)
 9 - 34:30 Q: Backwards compatability and robustness A: Only an issue with getTID (Thread ID), needed a small patch.

--[[User:Wlawrence|Wesley Lawrence]] 20:31, 29 November 2010 (UTC)

Wrote information in Critique for questions 3, 6 and 7 (Blocking Calls, Core Scheduling Issues, and When There Are Not More Threads Then Cores). If you feel any additions need to be made, please feel free to add them. Most importantly, I'm not sure how to cite these. All information as obtained from the mp3 of the presentation, could some one let me know how to go about citing this?

--[[User:Wlawrence|Wesley Lawrence]] 21:05, 29 November 2010 (UTC)

I'm going to run through the whole paper, and just make sure everything makes sense, and fill in the holes where needed. I'll also add my own thoughts along the way. Feel free to do the same.-Rarteaga

Added 3 sections to the critique, definitions for the remaining terms (thanks Corey for taking care of some of these) and did some editing. My plan is to add some more flesh to the FlexSC-Threads section.
I'll do that sometime before 3PM on Thursday. I'll also go over the paper at that time in case something needs some editing. --[[User:Tafatah|Tafatah]] 06:38, 2 December 2010 (UTC)

==Paper Summary==
I am not sure if everyone has taken the time to examine the paper closely, so I thought I would provide my notes on the paper so that anyone who has not read it could have a view of the high points.

Abstract:
- System calls are the accepted way to request services from the OS kernel, historical implementation.
- System calls almost always synchronous
- Aim to demonstrate how synchronous system calls negatively affect performance due mainly to pipeline flushing and pollution of key processor structures (TLB, data and instruction caches, etc.)
o TLB is translation lookaside buffer which is uses pages (data and code pages) to speed up virtual translation speed.
- Propose exception-less system calls to improve the current system call process.
o Improve processor efficiency via enabling flexible scheduling of OS work which in turn reduces size of execution both in kernel and user space thus reducing pollution effects on processor structures.
- Exception-less system calls especially effective on multi-core systems running multi-threaded applications.
- FlexSC is an implementation of exception-less system calls in the Linux kernel with accompanying user-mode threads from FlexSC-Threads package.
o Flex-SC-Threads convert legacy system calls into exception-less system calls.
Introduction:
- Synchronous system calls have a negative impact on system performance due to:
o Direct costs – mode switching
o Indirect costs – pollution of important processor structures
- Traditional system calls:
o Involve writing arguments to appropriate registers as well as issuing a special machine instruction which raises a synchronous exception.
o A processor exception is used to communicate with the kernel.
o Synchronous execution is enforced as the application expects the completion of the system call before user-mode execution resumes.
- Moore’s Law has provided large increases to performance potential of software while at the same time widening the gap between the performance of efficient and inefficient software.
o This gap is mainly caused by disparity of accessing different processor resources (registers, caches, memory)
- Server and system-intensive workloads are known to perform well below processor potential throughput.
o These are the items the researchers are mostly interested in.
o The cause is often described as due to the lack of locality.
o The researchers state this lack of locality is in part a result of the current synchronous system calls.
- When a synchronous system call, like pwrite, is called, the instruction per cycle level drops significantly and it takes many (in the example 14,000) cycles of execution for the instruction per cycle rate
to return to the level it was at before the system (pwrite) call.
- Exception-less System Call:
o Request for kernel services that does not require the use of synchronous processor exceptions.
o System calls are written to a reserved syscall page.
o Execution of system calls is performed asynchronously by special kernel level syscall threads. The result of the execution is stored on the syscall page after execution.
- By separating system call execution from system call invocation, the system can now have flexible system call scheduling.
o This allows system calls to be executed in batches, increasing the temporal locality of execution.
o Also provides a way to execute system calls on a separate core, in parallel to user-mode thread execution. This provides spatial per-core locality.
o An additional side effect is that now a multi-core system can have individual cores designated to run either user-mode or kernel mode execution dynamically depending on the current system load.
- In order to implement the exception-less system calls, the research team suggests adding a new M-on-N threading package.
o M user-mode threads executing on N kernel-visible threads.
o This would allow the threading package to harvest independent system calls, by switching threads, in user-mode, whenever a thread invokes a system call.
The (Real) Cost of System Calls
- Traditional way to measure the performance cost of system calls is the mode switch time. This is the time necessary to execute the system call instruction in user-mode, resume execution in kernel mode and
then return execution back to the user-mode.
- Mode switch in modern processors is a processor exception.
o Flush the user-mode pipeline, save registers onto the kernel stack, change the protection domain and redirect execution to the proper exception handler.
- Another measure of the performance of a system call is the state pollution caused by the system call.
o State pollution is the measure of how much user-mode data is overwritten in places like the TLB, cache (L1, L2, L3), branch prediction tables with kernel leel execution instructions for the system call.
o This data must be re-populated upon the return to user-mode.
- Potentially the most significant measure of cost of system calls is the performance impact on a running application.
o Ideally, user-mode instructions per cycle should not decrease as a result of a system call.
o Synchronous system calls do cause a drop in user-mode IPC due to; direct overhead - the processor exception associated with the system call which flushes the processor pipeline; and indirect overhead
– system call pollution on processors structures.
Exception-less System calls:
- System call batching
o By delaying a series of system calls and executing them in batches you can minimize the frequency of mode switches between user and kernel mode.
o Improves both the direct and indirect cost of system calls.
- Core specialization
o A system call can be scheduled on a different core then the core on which it was invoked, only for exception-less system calls.
o Provides ability to designate a core to run all system calls.
- Exception-less Syscall Interface
o Set of memory pages shared between user and kernel modes. Referred to as Syscall pages.
o User-space threads find a free entry in a syscall page and place a request for a system call. The user-space thread can then continue executing without interruption and must then return to the syscall
page to get the return value from the system call.
o Neither issuing the system call (via the syscall page) nor getting the return value generate an exception.
- Syscall pages
o Each page is a table of syscall entries.
o Each syscall entre has a state:
 Free – means a syscall can be added her
 Submitted – means the kernel can proceed to invoke the appropriate system call operations.
 Done – means the kernel is finished and has provided the return value to the syscall entry. User space thread must return and get the return value from the page.
- Decoupling Execution from Invocation
o To separate these two concepts a special kernel thread, syscall thread, is used.
o Sole purpose is to pull requests from syscall pages and execute them always in kernel mode.
o Syscall threads provide the ability to schedule the system calls on specific cores.
System Calls Galore – FlexSC-Threads
- Programming for exception-less system calls requires a different and more complex way of interacting with the kernel for OS functionality.
o The researchers describe working with exception-less system calls as being similar to event-driven programming in that you do not get the same sequential execution of code as you do with synchronous
system calls.
o In event-driven servers, the researchers suggest using a hybrid of both exception-less system calls (for performance critical paths) and regular synchronous system calls (for less critical system calls).
FlexSC-Threads
- Threading package which transforms synchronous system calls into exception-less system calls.
- Intended use is with server-type applications with which have many user-mode threads (like Apache or MySQL).
- Compatible with both POSIX threads and the default Linux thread library.
o As a result, multi-threaded Linux programs are immediately compatible with FlexSC threads without modification.
- For multi-core systems, a single kernel level thread is created for each core of the system. Multiple user-mode threads are multiplexed onto each kernel level thread via interactions with the syscall pages.
o The syscall pages are private to each kernel level thread, this means each core of a system has a syscall page from which it will receive system calls.
Overhead:
- When running a single exception-less system call against a single synchronous system call, the exception-less call was slower.
- When running a batch of exception-less system calls compared to a bunch of synchronous system calls, the exception-less system calls were much faster.
- The same is true for a remote server situation, one synchronous call is much faster than one exception-less system call but a batch of exception-less system calls is faster than the same number
of synchronous system calls.
Related Work:
- System Call Batching
o Operating systems have a concept called multi-calls which involves collecting multiple system calls and submitting them as a single system call.
o The Cassyopia compiler has an additional process called a looped multi-call where the result of one system call can be fed as an argument to another system call in the same multi-call.
o Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls do.
 Multi-call system calls are executed sequentially, each one must complete before the next may start.
- Locality of Execution and Multicores
o Other techniques include Soft Timers and Lazy Receiver Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to
limit processor interference associated with interrupt handling without affecting the latency of servicing requests.
o Computation Spreading is another locality process which is similar to FlexSC.
 Processor modifications that allow hardware migration of threads and migration to specialized cores.
 Did not model TLBs and on current hardware synchronous thread migration is a costly interprocessor interrupt.
o Also have proposals for dedicating CPU cores to specific operating system functionality.
 These solutions require a microkernel system.
 Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically.
- Non-blocking Execution
o Past research on improving system call performance has focused on blocking versus non-blocking behaviour.
 Typically researchers used threading, event-based (non-blocking) and hybrid systems to obtain high performance on server applications.
o Main difference between past research and FlexSC is that none of the past proposals have decoupled system call execution from system call invocation.
--[[User:Mike Preston|Mike Preston]] 04:03, 20 November 2010 (UTC)

COMP 3000 Essay 2 2010 Question 3

2010-12-02T06:28:09Z

Tafatah: /* Producer-Consumer Problem */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay. It is essential for one to comprehend the basic concepts of the language that is spoken in the paper in order to fully understand the ideas that are being discussed. The most important notions discussed in the FlexSC paper that are at the core of it all, are System calls[21], and Sychronous systems. These base definitons along with numerous other helpful ideas can be understood through the section of the paper to follow.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper. It is more vital to the reader to understand the core ideas of these defintions along with the underlying motivation for there existance, then to understand the miniscule details of their processes.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation.

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===
A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored.

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===
A syscall page is a collection of syscall entries. In turn, a sysentry is a 64-byte data structure, which includes information such as syscall number, number of arguments,
the arguments, status, and return value [1].

===Syscall Threads ===
Syscall threads is FlexSC's method to allow exception-less system calls. A syscall thread shares its process virtual address space [1].

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

[21] DeveloperWorks, Kernel Command using Linux System Calls IBM,2010.[http://www.ibm.com/developerworks/linux/library/l-system-calls/ ]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T06:27:53Z

Tafatah: /* Instructions Per Cycle (IPC) */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay. It is essential for one to comprehend the basic concepts of the language that is spoken in the paper in order to fully understand the ideas that are being discussed. The most important notions discussed in the FlexSC paper that are at the core of it all, are System calls[21], and Sychronous systems. These base definitons along with numerous other helpful ideas can be understood through the section of the paper to follow.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper. It is more vital to the reader to understand the core ideas of these defintions along with the underlying motivation for there existance, then to understand the miniscule details of their processes.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation.

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===
A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored.

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===
A syscall page is a collection of syscall entries. In turn, a sysentry is a 64-byte data structure, which includes information such as syscall number, number of arguments,
the arguments, status, and return value [1].

===Syscall Threads ===
Syscall threads is FlexSC's method to allow exception-less system calls. A syscall thread shares its process virtual address space [1].

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

[21] DeveloperWorks, Kernel Command using Linux System Calls IBM,2010.[http://www.ibm.com/developerworks/linux/library/l-system-calls/ ]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T06:26:12Z

Tafatah: /* Inter-Process Interrupt */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay. It is essential for one to comprehend the basic concepts of the language that is spoken in the paper in order to fully understand the ideas that are being discussed. The most important notions discussed in the FlexSC paper that are at the core of it all, are System calls[21], and Sychronous systems. These base definitons along with numerous other helpful ideas can be understood through the section of the paper to follow.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper. It is more vital to the reader to understand the core ideas of these defintions along with the underlying motivation for there existance, then to understand the miniscule details of their processes.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

I did some of these, some I don't think I can adequately explain, or just have no idea what they are, so I left them. --[[User:CFaibish|CFaibish]] 00:31, 2 December 2010 (UTC)

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation.

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===
A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored.

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===
A syscall page is a collection of syscall entries. In turn, a sysentry is a 64-byte data structure, which includes information such as syscall number, number of arguments,
the arguments, status, and return value [1].

===Syscall Threads ===
Syscall threads is FlexSC's method to allow exception-less system calls. A syscall thread shares its process virtual address space [1].

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

[21] DeveloperWorks, Kernel Command using Linux System Calls IBM,2010.[http://www.ibm.com/developerworks/linux/library/l-system-calls/ ]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T06:25:25Z

Tafatah: /* Syscall Threads */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay. It is essential for one to comprehend the basic concepts of the language that is spoken in the paper in order to fully understand the ideas that are being discussed. The most important notions discussed in the FlexSC paper that are at the core of it all, are System calls[21], and Sychronous systems. These base definitons along with numerous other helpful ideas can be understood through the section of the paper to follow.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper. It is more vital to the reader to understand the core ideas of these defintions along with the underlying motivation for there existance, then to understand the miniscule details of their processes.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

I did some of these, some I don't think I can adequately explain, or just have no idea what they are, so I left them. --[[User:CFaibish|CFaibish]] 00:31, 2 December 2010 (UTC)

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation.

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===
A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored.

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===
A syscall page is a collection of syscall entries. In turn, a sysentry is a 64-byte data structure, which includes information such as syscall number, number of arguments,
the arguments, status, and return value [1].

===Syscall Threads ===
Syscall threads is FlexSC's method to allow exception-less system calls. A syscall thread shares its process virtual address space [1].

===Inter-Process Interrupt ===

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

[21] DeveloperWorks, Kernel Command using Linux System Calls IBM,2010.[http://www.ibm.com/developerworks/linux/library/l-system-calls/ ]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T06:15:29Z

Tafatah: /* Syscall Page */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay. It is essential for one to comprehend the basic concepts of the language that is spoken in the paper in order to fully understand the ideas that are being discussed. The most important notions discussed in the FlexSC paper that are at the core of it all, are System calls[21], and Sychronous systems. These base definitons along with numerous other helpful ideas can be understood through the section of the paper to follow.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper. It is more vital to the reader to understand the core ideas of these defintions along with the underlying motivation for there existance, then to understand the miniscule details of their processes.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

I did some of these, some I don't think I can adequately explain, or just have no idea what they are, so I left them. --[[User:CFaibish|CFaibish]] 00:31, 2 December 2010 (UTC)

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation.

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===
A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored.

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===
A syscall page is a collection of syscall entries. In turn, a sysentry is a 64-byte data structure, which includes information such as syscall number, number of arguments,
the arguments, status, and return value [1].

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

[21] DeveloperWorks, Kernel Command using Linux System Calls IBM,2010.[http://www.ibm.com/developerworks/linux/library/l-system-calls/ ]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T06:14:55Z

Tafatah: /* Syscall Page */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay. It is essential for one to comprehend the basic concepts of the language that is spoken in the paper in order to fully understand the ideas that are being discussed. The most important notions discussed in the FlexSC paper that are at the core of it all, are System calls[21], and Sychronous systems. These base definitons along with numerous other helpful ideas can be understood through the section of the paper to follow.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper. It is more vital to the reader to understand the core ideas of these defintions along with the underlying motivation for there existance, then to understand the miniscule details of their processes.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

I did some of these, some I don't think I can adequately explain, or just have no idea what they are, so I left them. --[[User:CFaibish|CFaibish]] 00:31, 2 December 2010 (UTC)

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation.

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===
A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored.

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===
A syscall page is a collection of syscall entries. In turn, a sysentry is a 64-byte data structure, which includes information such as syscall number, number of arguments,
the arguments, status, and return value.

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

[21] DeveloperWorks, Kernel Command using Linux System Calls IBM,2010.[http://www.ibm.com/developerworks/linux/library/l-system-calls/ ]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T03:58:07Z

Tafatah: /* Regular Store Instructions */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

I did some of these, some I don't think I can adequately explain, or just have no idea what they are, so I left them. --[[User:CFaibish|CFaibish]] 00:31, 2 December 2010 (UTC)

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation.

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===
A store instruction simply refers to a typical assembly language instruction, where, usually, there are two arguments. A value, and a memory location, where that value should be stored.

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86, and other binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T03:22:19Z

Tafatah: /* Lack of Locality */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

I did some of these, some I don't think I can adequately explain, or just have no idea what they are, so I left them. --[[User:CFaibish|CFaibish]] 00:31, 2 December 2010 (UTC)

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===
As per paper, locality refers to both types of locality, i.e. temporal and spatial, defined above. Thus, lack of locality here means data and instructions needed most frequently by the application continues to be switched back and forth (from registers and caches) due to system calls, attributing hence, to performance degradation.

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86 binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes. System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them.[1] 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized by Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T00:53:46Z

Tafatah: /* References: */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

I did some of these, some I don't think I can adequately explain, or just have no idea what they are, so I left them. --[[User:CFaibish|CFaibish]] 00:31, 2 December 2010 (UTC)

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86 binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized in Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/ HTML] and Linux application page. [http://www.linux.org/apps/AppId_8088.html HTML]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf HTML]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf PDF]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T00:51:02Z

Tafatah: /* References: */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

I did some of these, some I don't think I can adequately explain, or just have no idea what they are, so I left them. --[[User:CFaibish|CFaibish]] 00:31, 2 December 2010 (UTC)

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86 binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized in Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498 HTML]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/] and Linux application page. [http://www.linux.org/apps/AppId_8088.html]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T00:44:28Z

Tafatah: /* Critique: */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

I did some of these, some I don't think I can adequately explain, or just have no idea what they are, so I left them. --[[User:CFaibish|CFaibish]] 00:31, 2 December 2010 (UTC)

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86 binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by disparity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized in Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the FlexSC scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/] and Linux application page. [http://www.linux.org/apps/AppId_8088.html]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf]

COMP 3000 Essay 2 2010 Question 3

2010-12-02T00:35:36Z

Tafatah: /* Not Done Yet */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

I did some of these, some I don't think I can adequately explain, or just have no idea what they are, so I left them. --[[User:CFaibish|CFaibish]] 00:31, 2 December 2010 (UTC)

===Translation Look-Aside Buffer (TLB)===
A TLB is a table used in a virtual memory system that lists the physical address page number associated with each virtual address page number. A TLB is used in conjunction with a cache whose tags are based on virtual addresses. The virtual address is presented simultaneously to the TLB and to the cache so that cache access and the virtual-to-physical address translation can proceed in parallel. If the requested address is not cached then the physical address is used to locate the data in main memory.

The TLB is the reason context switches can have such large performance penalties. Every time the OS switches context, the entire buffer is flushed. When the process resumes, it must be rebuilt from scratch. Too many context switches will therefore cause an increase in cache misses and degrade performance.[17]

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux Application Binary Interface (ABI)===
The ABI is a patch to the kernel that allows you to run SCO, Xenix, Solaris ix86 binaries on Linux.[18]

===Native POSIX Thread Library (NPTL)===
NPTL is a software component that allows the Linux kernel to run applications optimized for POSIX Thread efficiency.[19]

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===
Latency is a measure of the time delay between the start of an action and its completion in a system.[20]

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===IO ===
FlexSC is not suited for data intensive, IO centric applications, as realized in Vijay Vasudevan [16]. Vijay's research aims to reduce the energy footprint in data centers. FlexSC 
was considered. It was found that FlexSC's reduction of mode switches, via the use of shared memory pages between user space and kernel space is useful for reducing the impact 
of system calls. That technique however was not useful for IO intensive work since it did not remove the requirement of data copying and did not reduce the overheads associated 
with interrupts in IO intensive tasks. 

===Some Kernel Changes Are Required===
Though most of the work is done transparently. i.e. there is no need for application's code modification, there remains a need for small kernel change (3 lines of code), as per section 3.2 of the paper [1]. 
That means adopters, and after each update of the kernel, would have to add/modify the referenced lines and then recompile the kernel. 

===Multicore Systems ===
For a multicore system, the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running system call threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 
Further, the paper mentions that a predefined, static list of cores is used for system call threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. On a related note, scalability with increased cores is ambiguous. It is not that clear how scalable the 
scheduler is. One gets the impression that it is very scalable due to the fact that each core spawns a system call thread. Thus, as many threads as there are cores could be running 
concurrently, for one or more processes [1]. More explicit results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis 
of the results. Understandable, however, it would be nice to know if these threads (2 per core) would actually be treated as a core when turned on ? I.e. would the scheduler then realize 
that it can use eight cores ? Does that also mean the predefined static cores list would need to be modified, to list eight instead of four ? 
 
Along the same reasoning, and given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
outcome when using specialized GPUs, like NVIDIA's Tesla GPUs for example. Would FlexSC's scheduler be able to take advantage of the additional cores, and hence use them for 
specialized purposes ?

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

[17] Patricia J. Teller Translation-Lookaside Buffer Consistency, Journal Volume 23 Issue 6, IBM T. J. Watson Research Center, Yorktown Heights, NY, June 1990. [http://dx.doi.org/10.1109/2.55498]

[18] Linux ABI sourceforge page. [http://linux-abi.sourceforge.net/] and Linux application page. [http://www.linux.org/apps/AppId_8088.html]

[19] DREPPER, U., AND MOLNAR , I. The Native POSIX Thread Library for Linux. Tech. rep., RedHat Inc, 2003. [http://people.redhat.com/drepper/nptl-design.pdf]

[20] M. Brian Blake, Coordinating Multiple Agents for Workflow-Oriented Process Orchestration. Information Systems and e-Business Management Journal, Springer-Verlag, December 2003. [http://www.cs.georgetown.edu/~blakeb/pubs/blake_ISEB2003.pdf]

COMP 3000 Essay 2 2010 Question 3

2010-12-01T23:51:11Z

Tafatah: /* References: */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)
===TLB ===

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux ABI ===

===NPTL ===

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===Not Done Yet===
The following will be finalized on Wed. --[[User:Tafatah|Tafatah]] 02:26, 1 December 2010 (UTC) 
 START 
Good 
Most of the work is done transparently. i.e. there is no need for application 
code modification. 

Not for IO intensive tasks 

Small kernel change (3 lines of code as per section 3.2). That means adopters, and 
after each update of the kernel, would have to add/modify the referenced lines and 
then recompile the kernel 

For a multicore system (section 3, mostly 3.5), the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running syscall threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 

Further, the paper mentions that a predefined, static list of cores is used for syscall threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. 
 
Scaling with increased cores is ambiguous 
Further, it is not that clear how scalable the scheduler is. One gets the impression that it is very scalable due to the fact that each core 
spawns a syscall thread. Thus, as many threads as there are cores could be running concurrently, including for the same process. More explicit 
results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis of the results. 
Understandable, however, it would be nice to know if these threads ( 2 per core ) would actually be treated as a core when turned on ? I.e. would 
the scheduler then realize that it can use eight cores ? Does that also mean the predefined static cores list would have to be modified ? 
 
No testing with GPU 
Given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
effects when using specialized GPUs, like NVIDIA's Tesla GPUs for example. 
 END 

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes, Thesis Proposal, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010.[http://www.cs.cmu.edu/~vrv/proposal/vijay_thesis_proposal.pdf PDF]

COMP 3000 Essay 2 2010 Question 3

2010-12-01T23:48:17Z

Tafatah: /* References: */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)
===TLB ===

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux ABI ===

===NPTL ===

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===Not Done Yet===
The following will be finalized on Wed. --[[User:Tafatah|Tafatah]] 02:26, 1 December 2010 (UTC) 
 START 
Good 
Most of the work is done transparently. i.e. there is no need for application 
code modification. 

Not for IO intensive tasks 

Small kernel change (3 lines of code as per section 3.2). That means adopters, and 
after each update of the kernel, would have to add/modify the referenced lines and 
then recompile the kernel 

For a multicore system (section 3, mostly 3.5), the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running syscall threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 

Further, the paper mentions that a predefined, static list of cores is used for syscall threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. 
 
Scaling with increased cores is ambiguous 
Further, it is not that clear how scalable the scheduler is. One gets the impression that it is very scalable due to the fact that each core 
spawns a syscall thread. Thus, as many threads as there are cores could be running concurrently, including for the same process. More explicit 
results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis of the results. 
Understandable, however, it would be nice to know if these threads ( 2 per core ) would actually be treated as a core when turned on ? I.e. would 
the scheduler then realize that it can use eight cores ? Does that also mean the predefined static cores list would have to be modified ? 
 
No testing with GPU 
Given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
effects when using specialized GPUs, like NVIDIA's Tesla GPUs for example. 
 END 

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes. THESIS PROPOSAL, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, October 12, 2010, P 19.

COMP 3000 Essay 2 2010 Question 3

2010-12-01T23:46:30Z

Tafatah: /* References: */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)
===TLB ===

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux ABI ===

===NPTL ===

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===Not Done Yet===
The following will be finalized on Wed. --[[User:Tafatah|Tafatah]] 02:26, 1 December 2010 (UTC) 
 START 
Good 
Most of the work is done transparently. i.e. there is no need for application 
code modification. 

Not for IO intensive tasks 

Small kernel change (3 lines of code as per section 3.2). That means adopters, and 
after each update of the kernel, would have to add/modify the referenced lines and 
then recompile the kernel 

For a multicore system (section 3, mostly 3.5), the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running syscall threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 

Further, the paper mentions that a predefined, static list of cores is used for syscall threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. 
 
Scaling with increased cores is ambiguous 
Further, it is not that clear how scalable the scheduler is. One gets the impression that it is very scalable due to the fact that each core 
spawns a syscall thread. Thus, as many threads as there are cores could be running concurrently, including for the same process. More explicit 
results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis of the results. 
Understandable, however, it would be nice to know if these threads ( 2 per core ) would actually be treated as a core when turned on ? I.e. would 
the scheduler then realize that it can use eight cores ? Does that also mean the predefined static cores list would have to be modified ? 
 
No testing with GPU 
Given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
effects when using specialized GPUs, like NVIDIA's Tesla GPUs for example. 
 END 

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

[16] Vasudevan, Vijay. Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes. THESIS PROPOSAL, October 12, 2010, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA.

COMP 3000 Essay 2 2010 Question 3

2010-12-01T02:26:55Z

Tafatah: /* Critique: */

3.FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

== Paper ==
The Title of the paper we will be analyzing is named "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls". The authors of this paper consist of Livio Stores and Michael Stumm, both of which are from the University of Toronto. The paper can be viewed here, [http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf] for further details on specifics of the essay.
== Background Concepts: ==

In order to fully understand the FlexSC paper, it is essential to understand the key concepts that are discussed within the paper. Here listed below, are the main concepts required to fully comprehend the paper.

===System Call===
A System Call is the gateway between the User Space and the Kernel Space. The User Space is not given direct access to the Kernel's services, for several reasons (one being security), hence System calls are the messengers between the User and Kernel Space.[1][4] 

===Mode Switch===
Mode Switches speak of moving from one medium to another. Specifically moving from the User Space mode to the Kernel mode or Kernel mode to User Space. It does not matter which direction or which modes we are swtiching from, this is simply a general term. Crucial to mode switching is the mode switch time which is the time necessary to execute a system call instruction in user-mode, perform the kernel mode execution of the system call, and finally return the execution back to user-mode.[1] 

===Synchronous System Call===
Synchronous Execution Model(System call Interface) refers to the structure in which system calls specifically are managed in a serialized manner. Moreover, the synchronous model completes one system call at a time, and does not move onto the next system call until the previous system call is finished executing. This form of system call is blocking, meaning the process which initiates the system call is blocked until the system call returns. Traditionally, operating system calls are mostly synchronous system calls.[1][2] 

===Asynchronous System Call===
An asynchronous system call is a system call which does not block upon invocation; control of execution is returned to the calling process immediately. Asynchronous system calls do not necessarily execute in order and can be compared to event driven programming.[2][3] 

===System Call Pollution===
System Call Pollution is a more sophisticated manner of referring to wasteful or un-necessary delay in the system caused by system calls. This pollution is in direct correlation with the fact that the system call invokes a mode switch which is not a costless task. The "pollution" involved takes the form of data over-written in critical processor structures like the TLB (translation look-aside buffer - table which reduces the frequency of main memory access for page table entries), branch prediction tables, and the cache (L1, L2, L3).[1][3] 

===Processor Exceptions===
Processor exceptions are situations which cause the processor to stop current execution unexpectedly in order to handle the issue. There are many situations which generate processor exceptions including undefined instructions and software interrupts(system calls).[5] 

===System Call Batching===
System Call Batching is the concept of collecting system calls together to be executed in a group instead of executing them immediately after they are called.[6] 

===Temporal and Spatial Locality===
Locality is the concept that during execution there will be a tendency for the same set of data to be accessed repeatedly over a brief time period. There are two important forms of locality; spatial locality and temporal locality. Spatial locality refers to the pattern that memory locations in close physical proximity will be referenced close together in a short period of time. Temporal locality, on the other hand, is the tendency of recently requested memory locations to be requested again.[7][8] 

===Instructions Per Cycle (IPC)===
Instructions per cycle is the amount of instructions a processor can execute in a single clock cycle.[9] 

will add the following terms. 
TODO-Start --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)
===TLB ===

===Lack of Locality ===

===Throughput ===
Is an indication of how much work is done during a unit of time. E.g. n transactions per hour. The higher n is, the better. [2. P151]

===Regular Store Instructions ===

===Linux ABI ===

===NPTL ===

===Syscall Page ===

===Syscall Threads ===

===Inter-Process Interrupt ===

===Latency ===

===Producer-Consumer Problem ===
Note: explain the relationship

TODO End --[[User:Tafatah|Tafatah]] 16:13, 30 November 2010 (UTC)

== Research Problem: ==
System calls provide an interface for user-mode applications to request services from the operating system. Traditionally, the system call interface has been implemented using synchronous system calls, which block the calling user-space process when the system call is initiated. The benefit of using synchronous system calls comes from the easy to program nature of having sequential operation. However, this ease of use also comes with undesireable side effects which can slow down the instructions per cycle (IPC) of the processor.[9] In FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares and Stumm attempt to provide a new form of system call which minimizes the negative effects of synchronous system calls while still remaining easy to implement for application programmers.[1] 

The negative effects of synchronous system calls have been researched heavily, it is accepted that although easy to use, they are not optimal. Previous research includes work into system call batching such as multi-calls[6], locality of execution with multicore systems[7][8], and non-blocking execution. System call batching shares great similarity with FlexSC as multiple system calls are grouped together to reduce the amount of mode switches required of the system.[6] The difference is multi-calls do not make use of parallel execution of system calls nor do they manage the blocking aspect of synchronous system calls. FlexSC describes methods to handle both of these situations as described in the Contribution section of this document.[1] Previous research into locality of execution and multicore systems has focused on managing device interrupts and limiting processor interference associated with interrupt handling.[7][8] However, these solutions require a microkernel solution and although they can dedicate certain execution to specific cores of a system, they can not dynamically adapt to the proportion of cores used by the kernel and the cores shared between the kernel and the user like FlexSC can.[1] Non-blocking execution research has focused on threading, event-based (non-blocking) and hybrid solutions. However, FlexSC provides a mechanism to separate system call execution from system call invocation. This is a key difference between FlexSC and previous research.[1] 

== Contribution: ==

===Exception-Less System Calls===
Exception-less system calls are the research team's attempt to provide an alternative to synchronous systems calls. The downside to synchronous system calls includes the cumulative mode switch time of multiple system calls each called independently, state pollution of key processor structures (TLB, cache, etc.)[1][3], and, potentially the most crucial, the performance impact on the user-mode application during a system call. Exception-less system calls attempt to resolve these three issues through: 
1. System Call Batching: 
Instead of having each system call run as soon as it is called, FlexSC instead groups together system calls into batches. These batches can then be executed at one time thus minimizing the frequency of mode switches bewteen user and kernel modes. Batching provides a benefit both in terms of the direct cost of mode switching as well as the indirect cost, pollution of critical processor structures, associated with switching modes.[1] System call batching works by first requesting as many system calls as possible, then switching to kernel mode, and then executing each of them. 
2. Core Specialization 
On a multi-core system, FlexSC can provide the ability to designate a single core to run all system calls. The reason this is possible is that for an exception-less system call, the system call execution is decoupled from the system call invocation. This is described further in Decoupling Execution from Invocation section below.[1] 
3. Exception-less System Call Interface 
To provide an asynchronous interface to the kernel, FlexSC uses syscall pages. Syscall pages are a set of memory pages shared between user-mode and kernel-mode. User-space threads interact with syscall pages in order to make a request (system call) for kernel-mode procedures. A user-mode thread may make a system call request on a free entry of a syscall page, the syscall page will then run once the batch condition is met and store the return value on the syscall page. The user-mode thread can then return to the syscall page to obtain the return value. Neither issuing the system call via the syscall page nor getting the return value from the syscall page generate a processor exception. Each syscall page is a table of syscall entries. These entries may have one of three states: Free - meaning a syscall can be added to the entry; Submitted - meaning the kernel can proceed to invoke the appropriate system call operations; and Done - meaning the kernel is finished and the return value is ready for the user-mode thread to retrieve it.[1] 
4. Decoupling Execution from Invocation 
In order to separate a system call invocation from the execution of the system call, syscall threads were created. The sole purpose of syscall threads is to pull requests from syscall pages and execute the request, always in kernel mode. This is the mechanic that allows exception-less system calls to provide the ability for a user-mode thread to issue a request and continue to run while the kernel level system call is being executed. In addition, since the system call invocation is separate from execution, a process running on one core may request a system call yet the execution of the system call may be completed on an entirely different core. This allows exception-less system calls the unique capability of having all system call execution delegated to a specific core while other cores maintain user-mode execution.[1] 

===FlexSC Threads===
As mentioned above, FlexSC threads are a key component of the exception-less system call interface. FlexSC threads transform regular, synchronous system calls into exception-less system calls and are compatible with both the POSIX and default Linux thread libraries. This means that FlexSC Threads are immediately capable of running multi-threaded Linux applications with no modifications. The intended use of these threads is with server-type applications which contain many user-mode threads. In order to accomodate multiple user-mode threads, the FlexSC interface provides a syscall page for each core of a system. In this manner, multiple user-mode threads can be multiplexed onto a single syscall page which in turn has a single kernel level thread to facilitate execution of the system calls. Programming with FlexSC threads can be compared to event-driven programming as interactions are not guaranteed to be sequential. This does increase the complexity of programming for an exception-less system call interface as compared to the relatively simple synchronous system call interface.[1][2][3] 

== Critique: ==

===Moore's Law===
One interesting aspect of this paper is how the research relates to Moore's Law. Moore's Law states that the number of transistors on a chip doubles every 18 months.[10]. This has lead to very large increases in the performance potential of software but at the same time has opened a large gap between the actual performance of efficient and inefficient software. This paper claims that the gap is mainly caused by dispairity of accessing different processor resources such as registers, cache and memory.[1] In this manner, the FlexSC interface is not just an attempt to increase the efficiency of current system calls, but it is actually an attempt to change the way we view software. It is not simply enough to continue to build more powerful machines if the code we currently run will not speed up (become more efficient) along with the gain of power. Instead we need to focus on appropriate allocation and usage of the power as failure to do so is the origination of the gap between our potential and our performance. 

===Performance of FlexSC===
It is of particular interest to note that exception-less system calls only outperformed synchronous system calls when the system was running multiple system calls. For an individual system call, the overhead of the FlexSC interface was greater than a synchronous call. The real benefit of FlexSC comes when there are many system calls which can be in turn batched before execution. In this situation the FlexSC system far outperformed the traditional synchronous system calls.[1] This is why the research paper's focus is on server-like applications as server must handle many user requests efficiently to be useful. Thus, for a general case it appears that a hybrid solution of synchronous calls below some threshold and execption-less system calls above the same threshold would be most efficient. 

===Blocking Calls===
FlexSC relies on the fact that web and database servers have a lot of concurrency and independent parallelism. FlexSC can 'harvest' enough independent work so that it doesn't need to track dependencies between system calls. However, this could be a problem in other situations. Since FlexSC system calls are 'inherently asynchronous', if they need to block, FlexSC would jump to the next system call and execute that one. This can cause a problem for system calls such as reading and writing, where the write call has an outstanding dependency on the read call. However, this could be resolved by using some kind of combined system call, that is, multiple system calls executed as one single call. Unfortunately, FlexSC does not have any current handling for such an implementation. 

===Core Scheduling Issues===
In a system with X cores, FlexSC needs to dedicate some subset of cores for system calls. Currently, FlexSC first wakes up core X to run a system call thread, and when another batch comes in, if core X is still busy, it will then try core X-1, and so on. Of all the algorithms they tested, it turned out that this, the simplest algorithm, was the most efficient algorithm for FlexSC scheduling. However, this was only tested with FlexSC running a single application at a time. FlexSC's scheduling algorithm would need to be fine-tuned for running multiple applications. 

===When There Are Not More Threads Then Cores===
In situations where there is a single thread using 100% of a CPU, and acting primarily in user-space, such as 'Scientific Programs', FlexSC causes more overhead then performance gains. As a result, FlexSC is not an optimal implementation for cases such as this.

===Not Done Yet===
The following will be finalized on Wed. --[[User:Tafatah|Tafatah]] 02:26, 1 December 2010 (UTC) 
 START 
Good 
Most of the work is done transparently. i.e. there is no need for application 
code modification. 

Not for IO intensive tasks 

Small kernel change (3 lines of code as per section 3.2). That means adopters, and 
after each update of the kernel, would have to add/modify the referenced lines and 
then recompile the kernel 

For a multicore system (section 3, mostly 3.5), the Flex scheduler will attempt to choose a subset of the available cores and specialize them for running syscall threads. It is unclear how the dynamic 
allocation is done. It is mentioned that decisions are made based on the workload requirements, which doesn't exactly clarify the mechanism. 

Further, the paper mentions that a predefined, static list of cores is used for syscall threads assignments. It is unclear when that list is created. Is it at installation time, 
is it generated initially, or does the installer have to do any manual work. 
 
Scaling with increased cores is ambiguous 
Further, it is not that clear how scalable the scheduler is. One gets the impression that it is very scalable due to the fact that each core 
spawns a syscall thread. Thus, as many threads as there are cores could be running concurrently, including for the same process. More explicit 
results however would've been beneficial. Further, the paper mentions that hyper-threading was turned off to ease the analysis of the results. 
Understandable, however, it would be nice to know if these threads ( 2 per core ) would actually be treated as a core when turned on ? I.e. would 
the scheduler then realize that it can use eight cores ? Does that also mean the predefined static cores list would have to be modified ? 
 
No testing with GPU 
Given the growing popularity of GPU's use for general programming, it would've been useful to at-least hypothesize on the possible performance 
effects when using specialized GPUs, like NVIDIA's Tesla GPUs for example. 
 END 

== Related Work: ==

===System Call Batching===

Muti-calls is a concept which involves collecting multiple system calls and submitting them as a single system call. It is used both in operating systems and paravirtualized hypervisors. The Cassyopia compiler has a special technique name a looped multi-call, which is an additional process where the result of one system call can be fed as an argument to another system call in the same multi-call.[11] There is a significant difference between multi-calls and exception-less system calls. Multi-calls do not investigate parallel execution of system calls, nor do they address the blocking of system calls like exception-less system calls. Multi-call system calls are executed sequentially, each one must complete before the next may start. On the other hand, exception-less system calls can be executed in parallel, and in the presence of blocking, the next call can execute immediately. 

===Locality of Execution and Multicores===

Several techniques addressed the issue of locality of execution. Larus and Parkes proposed Cohort Scheduling to efficiently execute staged computations.[12] Other techniques include Soft Timers[13] and Lazy Receiver[14] Processing which try to tackle the issue of locality of execution by handling device interrupts. They both try to limit processor interference associated with interrupt handling without affecting the latency of servicing requests. Another technique name Computation Spreading[15] is most similar to the multicore execution of FlexSC. Processor modifications that allow hardware migration of threads and migration to specialized cores. However, they did not model TLBs on current hardware synchronous thread migration is a costly interprocessor interrupt. Another solution has 2 difference between FlexSC. They require a micro-kernel. Also FlexSC can dynamically adapt the proportion of cores used by the kernel or cores shared by user and kernel execution dynamically. While all these solutions rely on expensive inter-processor interrupts to offload system calls, FlexSC could provide a more efficient, and flexible mechanism. 

===Non-blocking Execution===

Past research on improving system call performance has focused extensively on blocking versus non-blocking behavior. Typically researchers used threading, event-based, which is non-blocking and hybrid systems to obtain high performance on server applications. The main difference between many of the proposals for non-blocking execution and FlexSC is that none of the non-blocking system calls have decoupled the system call invocation from its execution. 

== References: ==
[1] Soares, Livio and Michael Stumm, FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, University of Toronto, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Soares.pdf PDF]

[2] Tanenbaum, Andrew S., Modern Operating Systems: 3rd Edition, Pearson/Prentice Hall, New Jersey, 2008.

[3] Stallings, William, Operating Systems: Internals and Design Principles - 6th Edition, Pearson/Prentice Hall, New Jersey, 2009.

[4] Garfinkel, Tim, Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools, Computer Science Department - Stanford University.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2695&rep=rep1&type=pdf PDF]

[5] Yoo, Sunjoo et al., Automatic Generation of Fast Timed Simulation Models for Operating Systems in SoC Design, SLS Group, TIMA Laboratory, Grenoble, 2002.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1148&rep=rep1&type=pdf PDF]

[6] Rajagopalan, Mohan et al., Cassyopia: Compiler Assisted System Optimization, Poceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, 2003.[https://www.usenix.org/events/hotos03/tech/full_papers/rajagopalan/rajagopalan.pdf PDF]

[7] Kumar, Sanjeev and Christopher Wilkerson, Exploiting Spatial Locality in Data Caches using Spatial Footprints, Princeton University and Microcomputer Research Labs (Oregon), 1998.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.1550&rep=rep1&type=pdf PDF]

[8] Jin, Shudong and Azer Bestavros, Sources and Characteristics of Web Temporal Locality, Computer Science Depratment - Boston University, Boston. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.5941&rep=rep1&type=pdf PDF]

[9] Agarwal, Vikas et al., Clock Rate versus IPS: The End of the Road for Conventional Microarhitechtures, University of Texas, Austin, 2000.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3694&rep=rep1&type=pdf PDF]

[10] Tuomi, Ilkka, The Lives and Death of Moore's Law, 2002.[http://131.193.153.231/www/issues/issue7_11/tuomi/ HTML]

[11] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A., NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP) (2003), pp. 164–177.

[12] LARUS, J., AND PARKES, M. Using Cohort-Scheduling to Enhance Server Performance. In Proceedings of the annual conference on USENIX Annual Technical Conference (ATEC) (2002), pp. 103–114.

[13] ARON, M., AND DRUSCHEL, P. Soft timers: efficient microsecond software timer support for network processing. ACM Trans. Comput. Syst. (TOCS) 18, 3 (2000), 197–228.

[14] DRUSCHEL, P., AND BANGA, G. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI) (1996), pp. 261–275.

[15] CHAKRABORTY, K., WELLS, P. M., AND SOHI, G. S. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006), pp. 283–292.

COMP 3000 Essay 2 2010 Question 3

2010-11-30T19:26:59Z

Tafatah: /* Throughput */

COMP 3000 Essay 2 2010 Question 3

2010-11-30T19:25:34Z

Tafatah: /* Throughput */

COMP 3000 Essay 2 2010 Question 3

2010-11-30T16:34:20Z

Tafatah: /* Background Concepts: */

COMP 3000 Essay 2 2010 Question 3

2010-11-30T16:31:10Z

Tafatah: /* Background Concepts: */

COMP 3000 Essay 2 2010 Question 3

2010-11-30T16:20:15Z

Tafatah: /* Background Concepts: */

COMP 3000 Essay 2 2010 Question 3

2010-11-30T16:13:18Z

Tafatah: /* Background Concepts: */

Talk:COMP 3000 Essay 2 2010 Question 3

2010-11-15T01:37:02Z

Tafatah: /* Who is working on what ? */

Talk:COMP 3000 Essay 2 2010 Question 3

2010-11-15T01:36:44Z

Tafatah: /* Question 3 Group */

Talk:COMP 3000 Essay 2 2010 Question 3

2010-11-15T01:36:08Z

Tafatah: /* Group 3 Essay */

COMP 3000 Essay 1 2010 Question 9

2010-10-15T08:01:29Z

Tafatah: /* Conclusion */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the developed of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average use today.The average user at the time would not have many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing an retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
The newest file systems quickly closing in on ZFS. The forerunner of the pack is currently Btrfs. Designed with similar motivation as ZFS Btrfs provides a similarly rich feature set that is ideally suited for modern computing needs.

====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types keys, items and block headers. The data type of these trees are referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks the fist is a 64-bit object id followed by 8-bits denoting the item’s type and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive with interior nodes containing on keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data
.
As a minimum the Btrfs filesystem will contain three trees a tree that has contains a other tree roots. A subvolumes which holds file and directories. An extents volume that contains the information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal piece of a number of features of the Bdfs. Writes never occur on the same blocks. A transaction log is created and writes are cached the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the filesystem. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Bdfs provides a very simple underlying implementation ( algorithms non-withstanding ) and provides a number of features that help future proof this file system.

====WinFS====
====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features that ZFS does have. Btrfs doesn't have the self-healing capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs] ZFS does have three more years of development than Btrfs so Btrfs may very well catch up.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks and the problems with administrators continuously switching disks, performing partitions, copying old data to the new larger disks there was a significant amount of user downtime. Therefore, the administrators were attracted to the storage pool concept, and to the added flexibility provided by ZFS.

Lastly, given the increasing demands not only on vast amounts of storage, but on the continuous availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities, any inefficiency in traditional file systems will soon, if it hasn't already, become extremely intolerable.

Therefore, ZFS should be seriously considered as a smart file system solution for today and for the foreseeable future.

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx].

*Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx].

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186].

* Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

Talk:COMP 3000 Essay 1 2010 Question 9

2010-10-15T07:52:36Z

Tafatah: /* Who is doing what */

== Contacts / If interested ==
Tawfic : tfatah@gmail.com

Andy Zemancik: andy.zemancik@gmail.com

Lester Mundt: lmundt@gmail.com

Matthew Chou : mateh.cc@gmail.com (this is mchou2)

Nisrin Abou-Seido: naseido@connect.carleton.ca

== Suggested References Format ==
Author, publisher/university, Name of the article

== Who is doing what ==
Suggestion: In order to avoid duplication. Please state what section/item you're currently working on.

Azemanci: Currently working on Section Three Current File Systems.

Nisrin (naseido): Working on intro, conclusion and editing

Lester: Working on section 3 BTRFS and WinFS

Tawfic: added a conclusion. Not planning to add anything else. It's sleep time !!

KEY ISSUE: We need a thesis statement. Please suggest ideas.

--[[User:Tafatah|Tafatah]] 01:38, 15 October 2010 (UTC) I think the thesis statement is implied in the current intro, i.e. ".. of avoiding some of the major problems associated
with traditional file systems . . " suggests that it's unacceptable anymore to tolerate problems in today's IT environment. If
you'd like to expand on it, you could mention the continuous need for flexibility vis-a-vis using data. Example: there's a growing
trend with cloud computing (the marketable name for distributed computing). Users who will opt for that option will have to trust
the host companies with their data. It won't be acceptable to them to be told that a file here or a directory there was lost. The
issue with the growth of smart phones and yet to proliferate tablets also increases the demands for flexibility ( need to be able
to add/shrink/manage storage on the fly and avoid manual intervention when problems arise) . . etc. Hope that helps.

The conclusion would have to assert the main points regarding ZFS, i.e. it's modularization and administrative simplicity, it's
virtualization of storage via the use of SPA and DMU, and it's self healing abilities. I am currently working on the section
that talks ( briefly ) about ZFS's self-healing ( for lack of better words ). So I'll be online for sometime. The info on SPA
amd DMU is already there. If it's not clear enough, please let me know

--[[User:Lmundt|Lmundt]] 03:05, 15 October 2010 (UTC) I agree completely with the thesis and mentioned this at the bottom of the page.

Something along the lines of "ZFS is a file system designed to support the changing requirements of computing" as I had mentioned before "server needs was of particular attention" then continue to expand by describing the environment that is causing these changes "more companies with distributed large scale data storage "the cloud" " as Tawfic has suggested this establishes motivation.

Then we mention the goals that were desired extensibility( accomplished through modularization ), reliability ( checksums, copy on write ) and maintainability ( administrative simplicity)

In the intro to ZFS we talk about how it's feature set supports the design goals. Tawfic has then done most of the feature descriptions.

The section on legacy filesystems should have a mini-intro talking about the state of the enivronment they were designed for and there goals of the time"

Describe then which has been done.

Small contrast and compare with ZFS generally summarizing along the lines of "gee that sure is better"

Repeat for the other two sections current and future with each contrast/compare getting larger since they more comparable and with different conclusions.

--[[User:Lmundt|Lmundt]] 07:34, 15 October 2010 (UTC) Personally I am not certain if an example should be in the conclusion.... I like the opening lines though. Kind of disregarding the rest of the essay though I think.

== Deadline ==
Suggestion: Adding content should stop on Thursday, October 14'th at 3:00 PM. Any work after that
should go into formatting, spelling, and grammar checking.

--[[User:Lmundt|Lmundt]] 15:00, 14 October 2010 (UTC)
- I will definitely be adding content after this time probably late, late into the evening.

--[[User:Tafatah|Tafatah]] 19:25, 14 October 2010 (UTC) No problem. Forget about the suggested deadline. I thought we'd have to be done by 11:00Pm.
I am still adding stuff myself. I think Anil will lock the Wiki around 7:00 Am or so. So anytime
before that is Ok.

--[[User:Lmundt|Lmundt]] 07:22, 15 October 2010 (UTC)
I wish I could have got this started earlier but 4104 had a crazy assignment that destroyed me since Saturday.

== Essay Format Take 2 ==
Hello. I am suggesting the following format instead. If you agree, I'll take care of merging the existing info into this new format. My feeling is that this format is
more flexible and will (hopefully) allow individuals to take a section or a sub-section and work on it.

* '''Abstract'''
TO-DO: Main point. Current File Systems are neither versatile enough nor intelligent to handle the rapidly
growing needs of dynamic storage.

TO-DO: few statements regarding the WHYS as to the need for versatile storage (e.g. cloud computing, mobile environments, shifting consumer
demand . . etc )

TO-DO: few statements regarding the need for intelligence (just statements, the body will take care of expanding on these ). E.g. more
intelligent FS’s can include Metadata to help crime investigators, smart FS’s could be self healing . . .etc.

* '''Traditional File Systems'''
** '''Characteristics'''
** '''Limitations'''

* '''Zettabyte File System'''
** '''Characteristics'''
** '''Dissected'''
TO-DO: List the seven components of ZFS and basically what makes a ZFS
E.g. interface, various parts, and external needed libraries . . etc.

** '''Features Beyond Traditional File Systems'''

** '''Possible Real-Life Scenarios / Examples'''
TO-DO: 2-3 examples where ZFS was/could/is being considered for use.

TO-DO : One to two paragraphs stressing / reiterating the main points made in the abstract
thesis statement).

* '''Alternatives to ZFS'''
one example is good enough.
TO-DO: a brief description of the alternative.
Main argument for it’s viability.

** '''Pros/Cons'''
TO-DO: just a list of pluses and minuses

TO-DO : two to three paragraphs summarizing (this is the conclusion) the main points outlined in the abstract and the body, restating why traditional
FS’s are no longer viable, and stressing once more that ZFS is a valid alternative.

== Essay Format ==

I started working on the main page. The bullets are to be expanded. Other group are are working in their respective discussion pages but I think it's all right to put our work in progress on the front page. Thoughts?--[[User:Lmundt|Lmundt]] 16:14, 6 October 2010 (UTC)
* [[User:Gbint|Gbint]] 02:03, 7 October 2010 (UTC) Lmundt; what do you think of listing the capacities of the file system under major features? I was thinking that we could overview the features in brief, then delve into each one individually.
* --[[User:Lmundt|Lmundt]] 14:31, 7 October 2010 (UTC) I was thinking about the major structure... I like what your suggesting in one section. So here is the structure I am thinking of.

* Intro
* Section One ZFS
** Major feature 1
** Major feature 2
** Major feature 3
* Section Two Legacy File Systems
** Legacy File System1( FAT32 ) - what it does
** Legacy File System2( ext2 ) - what it does
** Contrast them with ZFS
* Section Three Current File Systems
** NTFS?
** ext4?
** Contrast them with ZFS
* Section Four future file Systems
** BTRFS
** WinFS or ??
** Contrast them with ZFS
* Conclusion

What does everyone think of this format? While everyone should contribute to section one we could divvy up the rest.

[[User:Gbint|Gbint]] 16:29, 9 October 2010 (UTC) The layout looks good; I filled out the data dedup section. I think it has reasonable coverage while staying away from becoming it's own essay just on deduplication.

The legacy file systems are really not even in the same world as ZFS, so I think the contrasting section should cover a lot of how storage needs have changed.

The current file systems are capable of being expanded into large pools of storage with good redundancy and even advanced features like data deduplication, but they are only a component in a chain of tools (like ext4 + lvm + mdraid + opendedup) rather than an full end-to-end solution.

--[[User:Lmundt|Lmundt]] 23:35, 9 October 2010 (UTC) The section on deduplication looks good I agree it looks like the right amount of coverage for a portion of a single section. Your also right about the old file systems not being able to hold a candle to ZFS and the conclusion section should talk about how storage needs and computers changed. And intro to that section could set the stage for that period as well. Non-multi-threaded, single processor system with much smaller RAM, even the applications were radically different the Internet was just single webpages without the high performance needs of web commerce and online banking for example. I have another assignment so won't be contributing too much until Monday.

--[[User:Tafatah|Tafatah]] 23:54, 10 October 2010 (UTC)
Please take a look at suggested essay format #2 and let me know soon. Time is running out Gents and Ladies :)

--[[User:Lmundt|Lmundt]] 15:35, 11 October 2010 (UTC)
I think I prefer the outline I proposed only because it's a very regimented contrast/compare essay format and should get us any marks for format. Most proper essays don't usually have a dedicated pros cons list. Heading more towards a report format I think. It's really what everyone agrees on. I won't be touching the essay until tomorrow though.

--[[User:Azemanci|Azemanci]] 17:32, 11 October 2010 (UTC)
I like Lmundt's outline. How would you like to divide up the work? Also can everyone post the contact information so we know exactly who is in our group.

--[[User:Tafatah|Tafatah]] 19:03, 11 October 2010 (UTC)
No problem, I'll go with the current format. One issue to keep in mind is that this is an essay, not a report. I.E. the intro/thesis has to include
a reasonable suggestion towards using ZFS as a reliable FS. The body and the conclusion would have to assert that. The current format satisfies that
if we keep these points in mind. I started looking into the "dissect subsection" in the format I suggested, which is related to the ZFS features
section one in the current format. I'll continue to look into that part (above section, who is doing what will be updated accordingly), i.e. I'll
take care of section one since I've already done some work on it. I suggest that each member of the group picks two items from one of the other
sections, except the contrasting part. Content in section one can then be used to finalize the comparisons in each of sections 2-4. The Intro/Abstract
and conclusion sections can be left to the end, and can be done collaboratively. I.E. once we have a very clear picture of all the
different pieces.

--[[User:Azemanci|Azemanci]] 03:18, 12 October 2010 (UTC)
I will begin working on section three current File Systems unless someone else has already begun working on it.

--[[User:Mchou2|Mchou2]] 20:29, 12 October 2010 (UTC)
I am going to start researching for section 2.

--[[User:Azemanci|Azemanci]] 03:15, 13 October 2010 (UTC) Alright so all the sections are being taken care of so we should be good to go for Thursday.

--[[User:Tafatah|Tafatah]] 04:35, 13 October 2010 (UTC) '''No one is assigned to section four''' ? Also, for those who haven't picked any section or subsection, please help out with the sections you're
more familiar with.

Finally, if you were in class today (well, technically yesterday), then you've heard Anil talk about plagiarism. I know this is common knowledge, so forgive
the annoying reminder. Please never copy and paste, and make sure to cite your info. As Anil mentioned, if anyone plagiarises, we are ALL responsible. It is
simply impossible for the rest of the group to check whether every member's sentence is genuine or not. So use your own words/phrases ( doesn't
have to be fancy or sophisticated ). If you're not sure, please check with the rest of the group.

Good luck, and good night.
--Tawfic

--[[User:Azemanci|Azemanci]] 14:55, 13 October 2010 (UTC) My bad I misread something I thought you were doing current file systems section 3. I'll take section 3 but then someone needs to do section 4. There are 4 of us so this should not be a problem.

--[[User:Naseido|Naseido]] 13 October 2010 Sorry I haven't contributed till now. The outline looks great and I think we can spend most of the day tomorrow editing to make sure all the sections fit together like an essay. I'll be doing section 4.

--[[User:Tafatah|Tafatah]] 16:11, 13 October 2010 (UTC) Hi. In section 4 the most important one is BTRFS. More info on that and less info on the others is Ok.

--[[User:Mchou2|Mchou2]] 03:00, 14 October 2010 (UTC)
I have done what I can for the legacy file systems, if someone who doesn't have any particular job wouldn't mind going over it and correcting any errors they see. I am also not familiar with how to edit/format these wiki pages so I tried my best and if you want to change the layout then please do, I would assume after we complete our sections and collaborate them into 1 essay that the formatting will change. I simply put headings on each section just so it is easier to read.

--[[User:Tafatah|Tafatah]] 04:55, 14 October 2010 (UTC) A reference for wiki editing http://meta.wikimedia.org/wiki/Help:Editing

--[[User:Azemanci|Azemanci]] 18:42, 14 October 2010 (UTC) I'm not going to have my info posted by 3:00. Also how and where are we supposed to cite our sources?

--[[User:Tafatah|Tafatah]] 19:28, 14 October 2010 (UTC) No worries. You have till 7:00 Am ( or till Anil locks the Wiki down, though I wouldn't count on more than 7) Friday, Oct 15. For citing, I am using
this convention. Bla.....Bla [Z1. P3] means I am using info from page 3 of article labeled as Z1 in references section.

--[[User:Lmundt|Lmundt]] 20:48, 14 October 2010 (UTC) Citation reference looks good. I think in each section we should talk about the motivation behind the design. A small into for each section. For example in the legacy area fat32 and ext, we could talk about computing becoming commonplace for the average worker so most development was for single users with a relatively small hardrive, not too many files. It will in-turn lead to justify/explanation for the design decission for each vintage of OS.
The main intro should about the motivations behind the design. Intended for servers with a focus or expandability, reliability and self maintenance. This is the motivation behind all those cool features we are detailing.

== Sources ==

Not from your group. Found a file which goes to the heart of your problem
[http://www.oracle.com/technetwork/server-storage/solaris/overview/zfs-14990
2.pdf ZFSDatasheet]
[[User:Gautam|Gautam]] 22:50, 5 October 2010 (UTC)

Thanks will take a look at that.--[[User:Lmundt|Lmundt]] 16:12, 6 October 2010 (UTC)

[[User:Gbint|Gbint]] 01:45, 7 October 2010 (UTC) paper from Sun engineers explaining why they came to build ZFS, the problems they wanted to solve:
* PDF: http://www.timwort.org/classp/200_HTML/docs/zfs_wp.pdf
* HTML: http://74.125.155.132/scholar?q=cache:6Ex3KbFo4lYJ:scholar.google.com/+zettabyte+file+system&hl=en&as_sdt=2000

Excellent article.[[User:Lmundt|Lmundt]] 14:24, 7 October 2010 (UTC)

Not too exciting but it looks like an easy read http://arstechnica.com/hardware/news/2008/03/past-present-future-file-systems.ars [[User:Lmundt|Lmundt]] 14:40, 7 October 2010 (UTC)

the [http://en.wikipedia.org/wiki/Comparison_of_file_systems wikipedia comparison] has some good tables, and if you click the various categories you can learn quite a bit about the various important features //not your group. [[User:Rift|Rift]] 18:56, 7 October 2010 (UTC)

Hey, I'm not from your group but I found this slideshow that was really handy in the assignment! http://www.slideshare.net/Clogeny/zfs-the-last-word-in-filesystems - nshires

------

Hey there. I'm not a member of your group. But you guys might want to look at this Wiki-page from the SolarisInternals website. I used it today for our assignment, a lot of interesting and in-depth breakdown of the ZFS file system: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Performance_Considerations

-- Munther

--[[User:Mchou2|Mchou2]] 03:56, 13 October 2010 (UTC) Good intro to understanding FAT FS
http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf

--[[User:Azemanci|Azemanci]] 18:49, 14 October 2010 (UTC)
Abit late but I found a comparison of current file systems including ZFS:
http://www.idt.mdh.se/kurser/ct3340/ht09/ADMINISTRATION/IRCSE09-submissions/ircse09_submission_16.pdf

http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-15T07:46:13Z

Tafatah: /* Conclusion */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the developed of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average use today.The average user at the time would not have many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing an retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types keys, items and block headers. The data type of these trees are referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks the fist is a 64-bit object id followed by 8-bits denoting the item’s type and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive with interior nodes containing on keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data
.
As a minimum the Btrfs filesystem will contain three trees a tree that has contains a other tree roots. A subvolumes which holds file and directories. An extents volume that contains the information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal piece of a number of features of the Bdfs. Writes never occur on the same blocks. A transaction log is created and writes are cached the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the filesystem. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Bdfs provides a very simple underlying implementation ( algorithms non-withstanding ) and provides a number of features that help future proof this file system.

====WinFS====
====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features that ZFS does have. Btrfs doesn't have the self-healing capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs] ZFS does have three more years of development than Btrfs so Btrfs may very well catch up.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that
was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks, the administrators kept having to switch disks, perform partitions, copy old data to the new larger disks . . etc. Naturally, that meant users' downtime.

The administrators were attracted to the storage pool concept in ZFS, as well as the fact that they could create a storage pool (zpool) in few simple commands. Besides the aforementioned features, ZFS offers some additional flexibility.

Users can choose to use a less demanding checksum algorithm if they are concerned about the slightly additional performance requirements. [Z1. P11]

Lastly, given the increasing demands not only on vast amounts of storage, but on the continues availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities will soon, if it hasn't already, become extremely intolerable.

ZFS is a worthy choice and should be seriously considered.

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx].

*Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx].

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186].

* Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T07:24:54Z

Tafatah: /* Conclusion */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types keys, items and block headers. The data type of these trees are referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks the fist is a 64-bit object id followed by 8-bits denoting the item’s type and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive with interior nodes containing on keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data
.
As a minimum the Btrfs filesystem will contain three trees a tree that has contains a other tree roots. A subvolumes which holds file and directories. An extents volume that contains the information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal piece of a number of features of the Bdfs. Writes never occur on the same blocks. A transaction log is created and writes are cached the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the filesystem. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

With

====WinFS====
====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features the ZFS does have. Btrfs doesn't have the self-heading capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs]

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that
was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks, the administrators kept having to switch disks, perform partitions, copy old data to the new larger disks . . etc. Naturally, that meant users' downtime.

The administrators were attracted to the storage pool concept in ZFS, as well as the fact that they could create a storage pool (zpool) in few simple commands. Besides the aforementioned features, ZFS offers some additional flexibility.

TO-DO: Performance V.S. efficiency. ZFS provides the ability to do away with checksumming

TO-DO: Plasma Physics Example

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The
new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T07:17:26Z

Tafatah: /* Conclusion */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types keys, items and block headers. The data type of these trees are referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks the fist is a 64-bit object id followed by 8-bits denoting the item’s type and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.
The trees are constructed of these primitive with interior nodes containing on keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data.
As a minimum the Btrfs filesystem will contain three trees a tree that has contains a other tree roots. A subvolumes which holds file and directories. An extents volume that contains the information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees are added to support other features such as logs, chunks and data relocation.

====WinFS====
====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features the ZFS does have. Btrfs doesn't have the self-heading capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs]

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that
was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

TO-DO: Performance V.S. efficiency. ZFS provides the ability to do away with checksumming

TO-DO: Plasma Physics Example

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The
new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T05:50:46Z

Tafatah: /* Conclusion */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller.

ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels of abstraction increase ZFS' flexibility and simplifies its management. [Z3 P2]. Lastly, with respect to flexibility at least, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU
objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. We will tackle the main strategies followed by ZFS.

These can be summed as : checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency thus is assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It helps here to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock [Z1. P8].
Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption.

If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. [Z1. P7-9].

Lastly, in ZFS's arsenal of self-healing, comes copy-on-write.

The DMU uses copy-on-write for all blocks. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified
traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. In Btrfs not only are files and directories stored in a B-tree but so are file system data such as extents, transaction logs, data relocation, chuncks. where a subvolume is a named b-tree made up of the files and directories stored.

The implementation of Btrfs is quite interesting the developers created it with good coding practices. Fist of all Btrfs has very simple underlying components this allows wasy understanding of the basic components by most developers. Additionally the system uses the B-trees structure extensively in addition to actual data and meta-data storage the structures that maintain and track space allocation use B-trees as well.

Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features the ZFS does have. Btrfs doesn't have the self-heading capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs]

====WinFS====
====Comparison====

== '''Conclusion''' ==

TO-DO: Performance V.S. efficiency. ZFS provides the ability to do away with checksumming

TO-DO: Plasma Physics Example

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The
new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T05:42:26Z

Tafatah: /* References */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller.

ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels of abstraction increase ZFS' flexibility and simplifies its management. [Z3 P2]. Lastly, with respect to flexibility at least, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU
objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. We will tackle the main strategies followed by ZFS.

These can be summed as : checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency thus is assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It helps here to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock [Z1. P8].
Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption.

If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. [Z1. P7-9].

Lastly, in ZFS's arsenal of self-healing, comes copy-on-write.

The DMU uses copy-on-write for all blocks. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified
traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of extensibility, reliability and maintainability are clearly visible in the feature set of Btrfs.

Btrfs is based on the b-tree structure. In Btrfs not only are files and directories stored in a B-tree but so are file system data such as extents, transaction logs, data relocation, chuncks. where a subvolume is a named b-tree made up of the files and directories stored.

The implementation of Btrfs is quite interesting the developers created it with good coding practices. Fist of all Btrfs has very simple underlying components this allows wasy understanding of the basic components by most developers. Additionally the system uses the B-trees structure extensively in addition to actual data and meta-data storage the structures that maintain and track space allocation use B-trees as well.

Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features the ZFS does have. Btrfs doesn't have the self-heading capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs]

====WinFS====
====Comparison====

== '''Conclusion''' ==

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The
new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T05:41:14Z

Tafatah: /* Conclusion */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller.

ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels of abstraction increase ZFS' flexibility and simplifies its management. [Z3 P2]. Lastly, with respect to flexibility at least, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU
objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. We will tackle the main strategies followed by ZFS.

These can be summed as : checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency thus is assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It helps here to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock [Z1. P8].
Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption.

If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. [Z1. P7-9].

Lastly, in ZFS's arsenal of self-healing, comes copy-on-write.

The DMU uses copy-on-write for all blocks. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified
traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of extensibility, reliability and maintainability are clearly visible in the feature set of Btrfs.

Btrfs is based on the b-tree structure. In Btrfs not only are files and directories stored in a B-tree but so are file system data such as extents, transaction logs, data relocation, chuncks. where a subvolume is a named b-tree made up of the files and directories stored.

The implementation of Btrfs is quite interesting the developers created it with good coding practices. Fist of all Btrfs has very simple underlying components this allows wasy understanding of the basic components by most developers. Additionally the system uses the B-trees structure extensively in addition to actual data and meta-data storage the structures that maintain and track space allocation use B-trees as well.

Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features the ZFS does have. Btrfs doesn't have the self-heading capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs]

====WinFS====
====Comparison====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The
new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

Talk:COMP 3000 Essay 1 2010 Question 9

2010-10-15T05:39:34Z

Tafatah: /* Who is doing what */

== Contacts / If interested ==
Tawfic : tfatah@gmail.com

Andy Zemancik: andy.zemancik@gmail.com

Lester Mundt: lmundt@gmail.com

Matthew Chou : mateh.cc@gmail.com (this is mchou2)

Nisrin Abou-Seido: naseido@connect.carleton.ca

== Suggested References Format ==
Author, publisher/university, Name of the article

== Who is doing what ==
Suggestion: In order to avoid duplication. Please state what section/item you're currently working on.

Azemanci: Currently working on Section Three Current File Systems.

Nisrin (naseido): Working on intro, conclusion and editing

Lester: Working on section 3 BTRFS and WinFS

Tawfic: Looking into the conclusion.

KEY ISSUE: We need a thesis statement. Please suggest ideas.

--[[User:Tafatah|Tafatah]] 01:38, 15 October 2010 (UTC) I think the thesis statement is implied in the current intro, i.e. ".. of avoiding some of the major problems associated
with traditional file systems . . " suggests that it's unacceptable anymore to tolerate problems in today's IT environment. If
you'd like to expand on it, you could mention the continuous need for flexibility vis-a-vis using data. Example: there's a growing
trend with cloud computing (the marketable name for distributed computing). Users who will opt for that option will have to trust
the host companies with their data. It won't be acceptable to them to be told that a file here or a directory there was lost. The
issue with the growth of smart phones and yet to proliferate tablets also increases the demands for flexibility ( need to be able
to add/shrink/manage storage on the fly and avoid manual intervention when problems arise) . . etc. Hope that helps.

The conclusion would have to assert the main points regarding ZFS, i.e. it's modularization and administrative simplicity, it's
virtualization of storage via the use of SPA and DMU, and it's self healing abilities. I am currently working on the section
that talks ( briefly ) about ZFS's self-healing ( for lack of better words ). So I'll be online for sometime. The info on SPA
amd DMU is already there. If it's not clear enough, please let me know

--[[User:Lmundt|Lmundt]] 03:05, 15 October 2010 (UTC) I agree completely with the thesis and mentioned this at the bottom of the page.

Something along the lines of "ZFS is a file system designed to support the changing requirements of computing" as I had mentioned before "server needs was of particular attention" then continue to expand by describing the environment that is causing these changes "more companies with distributed large scale data storage "the cloud" " as Tawfic has suggested this establishes motivation.

Then we mention the goals that were desired extensibility( accomplished through modularization ), reliability ( checksums, copy on write ) and maintainability ( administrative simplicity)

In the intro to ZFS we talk about how it's feature set supports the design goals. Tawfic has then done most of the feature descriptions.

The section on legacy filesystems should have a mini-intro talking about the state of the enivronment they were designed for and there goals of the time"

Describe then which has been done.

Small contrast and compare with ZFS generally summarizing along the lines of "gee that sure is better"

Repeat for the other two sections current and future with each contrast/compare getting larger since they more comparable and with different conclusions.

== Deadline ==
Suggestion: Adding content should stop on Thursday, October 14'th at 3:00 PM. Any work after that
should go into formatting, spelling, and grammar checking.

--[[User:Lmundt|Lmundt]] 15:00, 14 October 2010 (UTC)
- I will definitely be adding content after this time probably late, late into the evening.

--[[User:Tafatah|Tafatah]] 19:25, 14 October 2010 (UTC) No problem. Forget about the suggested deadline. I thought we'd have to be done by 11:00Pm.
I am still adding stuff myself. I think Anil will lock the Wiki around 7:00 Am or so. So anytime
before that is Ok.

== Essay Format Take 2 ==
Hello. I am suggesting the following format instead. If you agree, I'll take care of merging the existing info into this new format. My feeling is that this format is
more flexible and will (hopefully) allow individuals to take a section or a sub-section and work on it.

* '''Abstract'''
TO-DO: Main point. Current File Systems are neither versatile enough nor intelligent to handle the rapidly
growing needs of dynamic storage.

TO-DO: few statements regarding the WHYS as to the need for versatile storage (e.g. cloud computing, mobile environments, shifting consumer
demand . . etc )

TO-DO: few statements regarding the need for intelligence (just statements, the body will take care of expanding on these ). E.g. more
intelligent FS’s can include Metadata to help crime investigators, smart FS’s could be self healing . . .etc.

* '''Traditional File Systems'''
** '''Characteristics'''
** '''Limitations'''

* '''Zettabyte File System'''
** '''Characteristics'''
** '''Dissected'''
TO-DO: List the seven components of ZFS and basically what makes a ZFS
E.g. interface, various parts, and external needed libraries . . etc.

** '''Features Beyond Traditional File Systems'''

** '''Possible Real-Life Scenarios / Examples'''
TO-DO: 2-3 examples where ZFS was/could/is being considered for use.

TO-DO : One to two paragraphs stressing / reiterating the main points made in the abstract
thesis statement).

* '''Alternatives to ZFS'''
one example is good enough.
TO-DO: a brief description of the alternative.
Main argument for it’s viability.

** '''Pros/Cons'''
TO-DO: just a list of pluses and minuses

TO-DO : two to three paragraphs summarizing (this is the conclusion) the main points outlined in the abstract and the body, restating why traditional
FS’s are no longer viable, and stressing once more that ZFS is a valid alternative.

== Essay Format ==

I started working on the main page. The bullets are to be expanded. Other group are are working in their respective discussion pages but I think it's all right to put our work in progress on the front page. Thoughts?--[[User:Lmundt|Lmundt]] 16:14, 6 October 2010 (UTC)
* [[User:Gbint|Gbint]] 02:03, 7 October 2010 (UTC) Lmundt; what do you think of listing the capacities of the file system under major features? I was thinking that we could overview the features in brief, then delve into each one individually.
* --[[User:Lmundt|Lmundt]] 14:31, 7 October 2010 (UTC) I was thinking about the major structure... I like what your suggesting in one section. So here is the structure I am thinking of.

* Intro
* Section One ZFS
** Major feature 1
** Major feature 2
** Major feature 3
* Section Two Legacy File Systems
** Legacy File System1( FAT32 ) - what it does
** Legacy File System2( ext2 ) - what it does
** Contrast them with ZFS
* Section Three Current File Systems
** NTFS?
** ext4?
** Contrast them with ZFS
* Section Four future file Systems
** BTRFS
** WinFS or ??
** Contrast them with ZFS
* Conclusion

What does everyone think of this format? While everyone should contribute to section one we could divvy up the rest.

[[User:Gbint|Gbint]] 16:29, 9 October 2010 (UTC) The layout looks good; I filled out the data dedup section. I think it has reasonable coverage while staying away from becoming it's own essay just on deduplication.

The legacy file systems are really not even in the same world as ZFS, so I think the contrasting section should cover a lot of how storage needs have changed.

The current file systems are capable of being expanded into large pools of storage with good redundancy and even advanced features like data deduplication, but they are only a component in a chain of tools (like ext4 + lvm + mdraid + opendedup) rather than an full end-to-end solution.

--[[User:Lmundt|Lmundt]] 23:35, 9 October 2010 (UTC) The section on deduplication looks good I agree it looks like the right amount of coverage for a portion of a single section. Your also right about the old file systems not being able to hold a candle to ZFS and the conclusion section should talk about how storage needs and computers changed. And intro to that section could set the stage for that period as well. Non-multi-threaded, single processor system with much smaller RAM, even the applications were radically different the Internet was just single webpages without the high performance needs of web commerce and online banking for example. I have another assignment so won't be contributing too much until Monday.

--[[User:Tafatah|Tafatah]] 23:54, 10 October 2010 (UTC)
Please take a look at suggested essay format #2 and let me know soon. Time is running out Gents and Ladies :)

--[[User:Lmundt|Lmundt]] 15:35, 11 October 2010 (UTC)
I think I prefer the outline I proposed only because it's a very regimented contrast/compare essay format and should get us any marks for format. Most proper essays don't usually have a dedicated pros cons list. Heading more towards a report format I think. It's really what everyone agrees on. I won't be touching the essay until tomorrow though.

--[[User:Azemanci|Azemanci]] 17:32, 11 October 2010 (UTC)
I like Lmundt's outline. How would you like to divide up the work? Also can everyone post the contact information so we know exactly who is in our group.

--[[User:Tafatah|Tafatah]] 19:03, 11 October 2010 (UTC)
No problem, I'll go with the current format. One issue to keep in mind is that this is an essay, not a report. I.E. the intro/thesis has to include
a reasonable suggestion towards using ZFS as a reliable FS. The body and the conclusion would have to assert that. The current format satisfies that
if we keep these points in mind. I started looking into the "dissect subsection" in the format I suggested, which is related to the ZFS features
section one in the current format. I'll continue to look into that part (above section, who is doing what will be updated accordingly), i.e. I'll
take care of section one since I've already done some work on it. I suggest that each member of the group picks two items from one of the other
sections, except the contrasting part. Content in section one can then be used to finalize the comparisons in each of sections 2-4. The Intro/Abstract
and conclusion sections can be left to the end, and can be done collaboratively. I.E. once we have a very clear picture of all the
different pieces.

--[[User:Azemanci|Azemanci]] 03:18, 12 October 2010 (UTC)
I will begin working on section three current File Systems unless someone else has already begun working on it.

--[[User:Mchou2|Mchou2]] 20:29, 12 October 2010 (UTC)
I am going to start researching for section 2.

--[[User:Azemanci|Azemanci]] 03:15, 13 October 2010 (UTC) Alright so all the sections are being taken care of so we should be good to go for Thursday.

--[[User:Tafatah|Tafatah]] 04:35, 13 October 2010 (UTC) '''No one is assigned to section four''' ? Also, for those who haven't picked any section or subsection, please help out with the sections you're
more familiar with.

Finally, if you were in class today (well, technically yesterday), then you've heard Anil talk about plagiarism. I know this is common knowledge, so forgive
the annoying reminder. Please never copy and paste, and make sure to cite your info. As Anil mentioned, if anyone plagiarises, we are ALL responsible. It is
simply impossible for the rest of the group to check whether every member's sentence is genuine or not. So use your own words/phrases ( doesn't
have to be fancy or sophisticated ). If you're not sure, please check with the rest of the group.

Good luck, and good night.
--Tawfic

--[[User:Azemanci|Azemanci]] 14:55, 13 October 2010 (UTC) My bad I misread something I thought you were doing current file systems section 3. I'll take section 3 but then someone needs to do section 4. There are 4 of us so this should not be a problem.

--[[User:Naseido|Naseido]] 13 October 2010 Sorry I haven't contributed till now. The outline looks great and I think we can spend most of the day tomorrow editing to make sure all the sections fit together like an essay. I'll be doing section 4.

--[[User:Tafatah|Tafatah]] 16:11, 13 October 2010 (UTC) Hi. In section 4 the most important one is BTRFS. More info on that and less info on the others is Ok.

--[[User:Mchou2|Mchou2]] 03:00, 14 October 2010 (UTC)
I have done what I can for the legacy file systems, if someone who doesn't have any particular job wouldn't mind going over it and correcting any errors they see. I am also not familiar with how to edit/format these wiki pages so I tried my best and if you want to change the layout then please do, I would assume after we complete our sections and collaborate them into 1 essay that the formatting will change. I simply put headings on each section just so it is easier to read.

--[[User:Tafatah|Tafatah]] 04:55, 14 October 2010 (UTC) A reference for wiki editing http://meta.wikimedia.org/wiki/Help:Editing

--[[User:Azemanci|Azemanci]] 18:42, 14 October 2010 (UTC) I'm not going to have my info posted by 3:00. Also how and where are we supposed to cite our sources?

--[[User:Tafatah|Tafatah]] 19:28, 14 October 2010 (UTC) No worries. You have till 7:00 Am ( or till Anil locks the Wiki down, though I wouldn't count on more than 7) Friday, Oct 15. For citing, I am using
this convention. Bla.....Bla [Z1. P3] means I am using info from page 3 of article labeled as Z1 in references section.

--[[User:Lmundt|Lmundt]] 20:48, 14 October 2010 (UTC) Citation reference looks good. I think in each section we should talk about the motivation behind the design. A small into for each section. For example in the legacy area fat32 and ext, we could talk about computing becoming commonplace for the average worker so most development was for single users with a relatively small hardrive, not too many files. It will in-turn lead to justify/explanation for the design decission for each vintage of OS.
The main intro should about the motivations behind the design. Intended for servers with a focus or expandability, reliability and self maintenance. This is the motivation behind all those cool features we are detailing.

== Sources ==

Not from your group. Found a file which goes to the heart of your problem
[http://www.oracle.com/technetwork/server-storage/solaris/overview/zfs-14990
2.pdf ZFSDatasheet]
[[User:Gautam|Gautam]] 22:50, 5 October 2010 (UTC)

Thanks will take a look at that.--[[User:Lmundt|Lmundt]] 16:12, 6 October 2010 (UTC)

[[User:Gbint|Gbint]] 01:45, 7 October 2010 (UTC) paper from Sun engineers explaining why they came to build ZFS, the problems they wanted to solve:
* PDF: http://www.timwort.org/classp/200_HTML/docs/zfs_wp.pdf
* HTML: http://74.125.155.132/scholar?q=cache:6Ex3KbFo4lYJ:scholar.google.com/+zettabyte+file+system&hl=en&as_sdt=2000

Excellent article.[[User:Lmundt|Lmundt]] 14:24, 7 October 2010 (UTC)

Not too exciting but it looks like an easy read http://arstechnica.com/hardware/news/2008/03/past-present-future-file-systems.ars [[User:Lmundt|Lmundt]] 14:40, 7 October 2010 (UTC)

the [http://en.wikipedia.org/wiki/Comparison_of_file_systems wikipedia comparison] has some good tables, and if you click the various categories you can learn quite a bit about the various important features //not your group. [[User:Rift|Rift]] 18:56, 7 October 2010 (UTC)

Hey, I'm not from your group but I found this slideshow that was really handy in the assignment! http://www.slideshare.net/Clogeny/zfs-the-last-word-in-filesystems - nshires

------

Hey there. I'm not a member of your group. But you guys might want to look at this Wiki-page from the SolarisInternals website. I used it today for our assignment, a lot of interesting and in-depth breakdown of the ZFS file system: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Performance_Considerations

-- Munther

--[[User:Mchou2|Mchou2]] 03:56, 13 October 2010 (UTC) Good intro to understanding FAT FS
http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf

--[[User:Azemanci|Azemanci]] 18:49, 14 October 2010 (UTC)
Abit late but I found a comparison of current file systems including ZFS:
http://www.idt.mdh.se/kurser/ct3340/ht09/ADMINISTRATION/IRCSE09-submissions/ircse09_submission_16.pdf

http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-15T05:32:44Z

Tafatah: /* References */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller.

ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels of abstraction increase ZFS' flexibility and simplifies its management. [Z3 P2]. Lastly, with respect to flexibility at least, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU
objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. We will tackle the main strategies followed by ZFS.

These can be summed as : checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency thus is assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It helps here to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock [Z1. P8].
Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption.

If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. [Z1. P7-9].

Lastly, in ZFS's arsenal of self-healing, comes copy-on-write.

The DMU uses copy-on-write for all blocks. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified
traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of extensibility, reliability and maintainability are clearly visible in the feature set of Btrfs.

Btrfs is based on the b-tree structure. In Btrfs not only are files and directories stored in a B-tree but so are file system data such as extents, transaction logs, data relocation, chuncks. where a subvolume is a named b-tree made up of the files and directories stored.

The implementation of Btrfs is quite interesting the developers created it with good coding practices. Fist of all Btrfs has very simple underlying components this allows wasy understanding of the basic components by most developers. Additionally the system uses the B-trees structure extensively in addition to actual data and meta-data storage the structures that maintain and track space allocation use B-trees as well.

Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features the ZFS does have. Btrfs doesn't have the self-heading capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs]

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The
new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

Talk:COMP 3000 Essay 1 2010 Question 9

2010-10-15T05:17:48Z

Tafatah: /* Who is doing what */

== Contacts / If interested ==
Tawfic : tfatah@gmail.com

Andy Zemancik: andy.zemancik@gmail.com

Lester Mundt: lmundt@gmail.com

Matthew Chou : mateh.cc@gmail.com (this is mchou2)

Nisrin Abou-Seido: naseido@connect.carleton.ca

== Suggested References Format ==
Author, publisher/university, Name of the article

== Who is doing what ==
Suggestion: In order to avoid duplication. Please state what section/item you're currently working on.

Azemanci: Currently working on Section Three Current File Systems.

Nisrin (naseido): Working on intro, conclusion and editing

Lester: Working on section 3 BTRFS and WinFS

KEY ISSUE: We need a thesis statement. Please suggest ideas.

--[[User:Tafatah|Tafatah]] 01:38, 15 October 2010 (UTC) I think the thesis statement is implied in the current intro, i.e. ".. of avoiding some of the major problems associated
with traditional file systems . . " suggests that it's unacceptable anymore to tolerate problems in today's IT environment. If
you'd like to expand on it, you could mention the continuous need for flexibility vis-a-vis using data. Example: there's a growing
trend with cloud computing (the marketable name for distributed computing). Users who will opt for that option will have to trust
the host companies with their data. It won't be acceptable to them to be told that a file here or a directory there was lost. The
issue with the growth of smart phones and yet to proliferate tablets also increases the demands for flexibility ( need to be able
to add/shrink/manage storage on the fly and avoid manual intervention when problems arise) . . etc. Hope that helps.

The conclusion would have to assert the main points regarding ZFS, i.e. it's modularization and administrative simplicity, it's
virtualization of storage via the use of SPA and DMU, and it's self healing abilities. I am currently working on the section
that talks ( briefly ) about ZFS's self-healing ( for lack of better words ). So I'll be online for sometime. The info on SPA
amd DMU is already there. If it's not clear enough, please let me know

--[[User:Lmundt|Lmundt]] 03:05, 15 October 2010 (UTC) I agree completely with the thesis and mentioned this at the bottom of the page.

Something along the lines of "ZFS is a file system designed to support the changing requirements of computing" as I had mentioned before "server needs was of particular attention" then continue to expand by describing the environment that is causing these changes "more companies with distributed large scale data storage "the cloud" " as Tawfic has suggested this establishes motivation.

Then we mention the goals that were desired extensibility( accomplished through modularization ), reliability ( checksums, copy on write ) and maintainability ( administrative simplicity)

In the intro to ZFS we talk about how it's feature set supports the design goals. Tawfic has then done most of the feature descriptions.

The section on legacy filesystems should have a mini-intro talking about the state of the enivronment they were designed for and there goals of the time"

Describe then which has been done.

Small contrast and compare with ZFS generally summarizing along the lines of "gee that sure is better"

Repeat for the other two sections current and future with each contrast/compare getting larger since they more comparable and with different conclusions.

== Deadline ==
Suggestion: Adding content should stop on Thursday, October 14'th at 3:00 PM. Any work after that
should go into formatting, spelling, and grammar checking.

--[[User:Lmundt|Lmundt]] 15:00, 14 October 2010 (UTC)
- I will definitely be adding content after this time probably late, late into the evening.

--[[User:Tafatah|Tafatah]] 19:25, 14 October 2010 (UTC) No problem. Forget about the suggested deadline. I thought we'd have to be done by 11:00Pm.
I am still adding stuff myself. I think Anil will lock the Wiki around 7:00 Am or so. So anytime
before that is Ok.

== Essay Format Take 2 ==
Hello. I am suggesting the following format instead. If you agree, I'll take care of merging the existing info into this new format. My feeling is that this format is
more flexible and will (hopefully) allow individuals to take a section or a sub-section and work on it.

* '''Abstract'''
TO-DO: Main point. Current File Systems are neither versatile enough nor intelligent to handle the rapidly
growing needs of dynamic storage.

TO-DO: few statements regarding the WHYS as to the need for versatile storage (e.g. cloud computing, mobile environments, shifting consumer
demand . . etc )

TO-DO: few statements regarding the need for intelligence (just statements, the body will take care of expanding on these ). E.g. more
intelligent FS’s can include Metadata to help crime investigators, smart FS’s could be self healing . . .etc.

* '''Traditional File Systems'''
** '''Characteristics'''
** '''Limitations'''

* '''Zettabyte File System'''
** '''Characteristics'''
** '''Dissected'''
TO-DO: List the seven components of ZFS and basically what makes a ZFS
E.g. interface, various parts, and external needed libraries . . etc.

** '''Features Beyond Traditional File Systems'''

** '''Possible Real-Life Scenarios / Examples'''
TO-DO: 2-3 examples where ZFS was/could/is being considered for use.

TO-DO : One to two paragraphs stressing / reiterating the main points made in the abstract
thesis statement).

* '''Alternatives to ZFS'''
one example is good enough.
TO-DO: a brief description of the alternative.
Main argument for it’s viability.

** '''Pros/Cons'''
TO-DO: just a list of pluses and minuses

TO-DO : two to three paragraphs summarizing (this is the conclusion) the main points outlined in the abstract and the body, restating why traditional
FS’s are no longer viable, and stressing once more that ZFS is a valid alternative.

== Essay Format ==

I started working on the main page. The bullets are to be expanded. Other group are are working in their respective discussion pages but I think it's all right to put our work in progress on the front page. Thoughts?--[[User:Lmundt|Lmundt]] 16:14, 6 October 2010 (UTC)
* [[User:Gbint|Gbint]] 02:03, 7 October 2010 (UTC) Lmundt; what do you think of listing the capacities of the file system under major features? I was thinking that we could overview the features in brief, then delve into each one individually.
* --[[User:Lmundt|Lmundt]] 14:31, 7 October 2010 (UTC) I was thinking about the major structure... I like what your suggesting in one section. So here is the structure I am thinking of.

* Intro
* Section One ZFS
** Major feature 1
** Major feature 2
** Major feature 3
* Section Two Legacy File Systems
** Legacy File System1( FAT32 ) - what it does
** Legacy File System2( ext2 ) - what it does
** Contrast them with ZFS
* Section Three Current File Systems
** NTFS?
** ext4?
** Contrast them with ZFS
* Section Four future file Systems
** BTRFS
** WinFS or ??
** Contrast them with ZFS
* Conclusion

What does everyone think of this format? While everyone should contribute to section one we could divvy up the rest.

[[User:Gbint|Gbint]] 16:29, 9 October 2010 (UTC) The layout looks good; I filled out the data dedup section. I think it has reasonable coverage while staying away from becoming it's own essay just on deduplication.

The legacy file systems are really not even in the same world as ZFS, so I think the contrasting section should cover a lot of how storage needs have changed.

The current file systems are capable of being expanded into large pools of storage with good redundancy and even advanced features like data deduplication, but they are only a component in a chain of tools (like ext4 + lvm + mdraid + opendedup) rather than an full end-to-end solution.

--[[User:Lmundt|Lmundt]] 23:35, 9 October 2010 (UTC) The section on deduplication looks good I agree it looks like the right amount of coverage for a portion of a single section. Your also right about the old file systems not being able to hold a candle to ZFS and the conclusion section should talk about how storage needs and computers changed. And intro to that section could set the stage for that period as well. Non-multi-threaded, single processor system with much smaller RAM, even the applications were radically different the Internet was just single webpages without the high performance needs of web commerce and online banking for example. I have another assignment so won't be contributing too much until Monday.

--[[User:Tafatah|Tafatah]] 23:54, 10 October 2010 (UTC)
Please take a look at suggested essay format #2 and let me know soon. Time is running out Gents and Ladies :)

--[[User:Lmundt|Lmundt]] 15:35, 11 October 2010 (UTC)
I think I prefer the outline I proposed only because it's a very regimented contrast/compare essay format and should get us any marks for format. Most proper essays don't usually have a dedicated pros cons list. Heading more towards a report format I think. It's really what everyone agrees on. I won't be touching the essay until tomorrow though.

--[[User:Azemanci|Azemanci]] 17:32, 11 October 2010 (UTC)
I like Lmundt's outline. How would you like to divide up the work? Also can everyone post the contact information so we know exactly who is in our group.

--[[User:Tafatah|Tafatah]] 19:03, 11 October 2010 (UTC)
No problem, I'll go with the current format. One issue to keep in mind is that this is an essay, not a report. I.E. the intro/thesis has to include
a reasonable suggestion towards using ZFS as a reliable FS. The body and the conclusion would have to assert that. The current format satisfies that
if we keep these points in mind. I started looking into the "dissect subsection" in the format I suggested, which is related to the ZFS features
section one in the current format. I'll continue to look into that part (above section, who is doing what will be updated accordingly), i.e. I'll
take care of section one since I've already done some work on it. I suggest that each member of the group picks two items from one of the other
sections, except the contrasting part. Content in section one can then be used to finalize the comparisons in each of sections 2-4. The Intro/Abstract
and conclusion sections can be left to the end, and can be done collaboratively. I.E. once we have a very clear picture of all the
different pieces.

--[[User:Azemanci|Azemanci]] 03:18, 12 October 2010 (UTC)
I will begin working on section three current File Systems unless someone else has already begun working on it.

--[[User:Mchou2|Mchou2]] 20:29, 12 October 2010 (UTC)
I am going to start researching for section 2.

--[[User:Azemanci|Azemanci]] 03:15, 13 October 2010 (UTC) Alright so all the sections are being taken care of so we should be good to go for Thursday.

--[[User:Tafatah|Tafatah]] 04:35, 13 October 2010 (UTC) '''No one is assigned to section four''' ? Also, for those who haven't picked any section or subsection, please help out with the sections you're
more familiar with.

Finally, if you were in class today (well, technically yesterday), then you've heard Anil talk about plagiarism. I know this is common knowledge, so forgive
the annoying reminder. Please never copy and paste, and make sure to cite your info. As Anil mentioned, if anyone plagiarises, we are ALL responsible. It is
simply impossible for the rest of the group to check whether every member's sentence is genuine or not. So use your own words/phrases ( doesn't
have to be fancy or sophisticated ). If you're not sure, please check with the rest of the group.

Good luck, and good night.
--Tawfic

--[[User:Azemanci|Azemanci]] 14:55, 13 October 2010 (UTC) My bad I misread something I thought you were doing current file systems section 3. I'll take section 3 but then someone needs to do section 4. There are 4 of us so this should not be a problem.

--[[User:Naseido|Naseido]] 13 October 2010 Sorry I haven't contributed till now. The outline looks great and I think we can spend most of the day tomorrow editing to make sure all the sections fit together like an essay. I'll be doing section 4.

--[[User:Tafatah|Tafatah]] 16:11, 13 October 2010 (UTC) Hi. In section 4 the most important one is BTRFS. More info on that and less info on the others is Ok.

--[[User:Mchou2|Mchou2]] 03:00, 14 October 2010 (UTC)
I have done what I can for the legacy file systems, if someone who doesn't have any particular job wouldn't mind going over it and correcting any errors they see. I am also not familiar with how to edit/format these wiki pages so I tried my best and if you want to change the layout then please do, I would assume after we complete our sections and collaborate them into 1 essay that the formatting will change. I simply put headings on each section just so it is easier to read.

--[[User:Tafatah|Tafatah]] 04:55, 14 October 2010 (UTC) A reference for wiki editing http://meta.wikimedia.org/wiki/Help:Editing

--[[User:Azemanci|Azemanci]] 18:42, 14 October 2010 (UTC) I'm not going to have my info posted by 3:00. Also how and where are we supposed to cite our sources?

--[[User:Tafatah|Tafatah]] 19:28, 14 October 2010 (UTC) No worries. You have till 7:00 Am ( or till Anil locks the Wiki down, though I wouldn't count on more than 7) Friday, Oct 15. For citing, I am using
this convention. Bla.....Bla [Z1. P3] means I am using info from page 3 of article labeled as Z1 in references section.

--[[User:Lmundt|Lmundt]] 20:48, 14 October 2010 (UTC) Citation reference looks good. I think in each section we should talk about the motivation behind the design. A small into for each section. For example in the legacy area fat32 and ext, we could talk about computing becoming commonplace for the average worker so most development was for single users with a relatively small hardrive, not too many files. It will in-turn lead to justify/explanation for the design decission for each vintage of OS.
The main intro should about the motivations behind the design. Intended for servers with a focus or expandability, reliability and self maintenance. This is the motivation behind all those cool features we are detailing.

== Sources ==

Not from your group. Found a file which goes to the heart of your problem
[http://www.oracle.com/technetwork/server-storage/solaris/overview/zfs-14990
2.pdf ZFSDatasheet]
[[User:Gautam|Gautam]] 22:50, 5 October 2010 (UTC)

Thanks will take a look at that.--[[User:Lmundt|Lmundt]] 16:12, 6 October 2010 (UTC)

[[User:Gbint|Gbint]] 01:45, 7 October 2010 (UTC) paper from Sun engineers explaining why they came to build ZFS, the problems they wanted to solve:
* PDF: http://www.timwort.org/classp/200_HTML/docs/zfs_wp.pdf
* HTML: http://74.125.155.132/scholar?q=cache:6Ex3KbFo4lYJ:scholar.google.com/+zettabyte+file+system&hl=en&as_sdt=2000

Excellent article.[[User:Lmundt|Lmundt]] 14:24, 7 October 2010 (UTC)

Not too exciting but it looks like an easy read http://arstechnica.com/hardware/news/2008/03/past-present-future-file-systems.ars [[User:Lmundt|Lmundt]] 14:40, 7 October 2010 (UTC)

the [http://en.wikipedia.org/wiki/Comparison_of_file_systems wikipedia comparison] has some good tables, and if you click the various categories you can learn quite a bit about the various important features //not your group. [[User:Rift|Rift]] 18:56, 7 October 2010 (UTC)

Hey, I'm not from your group but I found this slideshow that was really handy in the assignment! http://www.slideshare.net/Clogeny/zfs-the-last-word-in-filesystems - nshires

------

Hey there. I'm not a member of your group. But you guys might want to look at this Wiki-page from the SolarisInternals website. I used it today for our assignment, a lot of interesting and in-depth breakdown of the ZFS file system: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Performance_Considerations

-- Munther

--[[User:Mchou2|Mchou2]] 03:56, 13 October 2010 (UTC) Good intro to understanding FAT FS
http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf

--[[User:Azemanci|Azemanci]] 18:49, 14 October 2010 (UTC)
Abit late but I found a comparison of current file systems including ZFS:
http://www.idt.mdh.se/kurser/ct3340/ht09/ADMINISTRATION/IRCSE09-submissions/ircse09_submission_16.pdf

http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-15T05:16:48Z

Tafatah: /* ZFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller.

ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels of abstraction increase ZFS' flexibility and simplifies its management. [Z3 P2]. Lastly, with respect to flexibility at least, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU
objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. We will tackle the main strategies followed by ZFS.

These can be summed as : checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency thus is assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It helps here to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock [Z1. P8].
Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption.

If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. [Z1. P7-9].

Lastly, in ZFS's arsenal of self-healing, comes copy-on-write.

The DMU uses copy-on-write for all blocks. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified
traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different.

Btrfs is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features the ZFS does have. Btrfs doesn't have the self-heading capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs]

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T05:15:43Z

Tafatah: /* ZFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller.

ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels of abstraction increase ZFS' flexibility and simplifies its management. [Z3 P2]. Lastly, with respect to flexibility at least, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU
objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. We will tackle the main strategies followed by ZFS.

These can be summed as : checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency thus is assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It helps here to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock [Z1. P8].
Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption.

If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. [Z1. P7-9].

Lastly, in ZFS's arsenal of self-healing, comes copy-on-write.

Lastly, in ZFS's arsenal of self-healing, comes copy-on-write.

The DMU uses copy-on-write for all blocks. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified
traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different.

Btrfs is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features the ZFS does have. Btrfs doesn't have the self-heading capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs]

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T03:38:23Z

Tafatah: /* ZFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller.

ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels of abstraction increase ZFS' flexibility and simplifies its management. [Z3 P2]. Lastly, with respect to flexibility at least, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU
objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. We will tackle the main strategies followed by ZFS.

These can be summed as : checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency thus is assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It helps here to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock [Z1. P8].
Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption.

If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. [Z1. P7-9].

Lastly, in ZFS's arsenal of self-healing, comes copy-on-write.

TO-DO
copy-on-write
Almost finished --Tawfic

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4]

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-15T02:43:37Z

Tafatah: /* ZFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller.

ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels of abstraction increase ZFS' flexibility and simplifies its management. [Z3 P2]. Lastly, with respect to flexibility at least, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU
objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. We will tackle the main strategies followed by ZFS.

These can be summed as : checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency thus is assured since updates are done atomically.

TO-DO
copy-on-write
checksumming

TO-DO : Not finished yet --Tawfic

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4]

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-15T01:45:18Z

Tafatah: /* ZFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller.

ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels of abstraction increase ZFS' flexibility and simplifies its management. [Z3 P2]. Lastly, with respect to flexibility at least, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU
objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4]

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

Talk:COMP 3000 Essay 1 2010 Question 9

2010-10-15T01:38:47Z

Tafatah: /* Who is doing what */

== Contacts / If interested ==
Tawfic : tfatah@gmail.com

Andy Zemancik: andy.zemancik@gmail.com

Lester Mundt: lmundt@gmail.com

Matthew Chou : mateh.cc@gmail.com (this is mchou2)

Nisrin Abou-Seido: naseido@connect.carleton.ca

== Suggested References Format ==
Author, publisher/university, Name of the article

== Who is doing what ==
Suggestion: In order to avoid duplication. Please state what section/item you're currently working on.

Tawfic : Currently working on Section One ZFS.

Azemanci: Currently working on Section Three Current File Systems.

Nisrin (naseido): Working on intro, conclusion and editing

KEY ISSUE: We need a thesis statement. Please suggest ideas.

--[[User:Tafatah|Tafatah]] 01:38, 15 October 2010 (UTC) I think the thesis statement is implied in the current intro, i.e. ".. of avoiding some of the major problems associated
with traditional file systems . . " suggests that it's unacceptable anymore to tolerate problems in today's IT environment. If
you'd like to expand on it, you could mention the continuous need for flexibility vis-a-vis using data. Example: there's a growing
trend with cloud computing (the marketable name for distributed computing). Users who will opt for that option will have to trust
the host companies with their data. It won't be acceptable to them to be told that a file here or a directory there was lost. The
issue with the growth of smart phones and yet to proliferate tablets also increases the demands for flexibility ( need to be able
to add/shrink/manage storage on the fly and avoid manual intervention when problems arise) . . etc. Hope that helps.

The conclusion would have to assert the main points regarding ZFS, i.e. it's modularization and administrative simplicity, it's
virtualization of storage via the use of SPA and DMU, and it's self healing abilities. I am currently working on the section
that talks ( briefly ) about ZFS's self-healing ( for lack of better words ). So I'll be online for sometime. The info on SPA
amd DMU is already there. If it's not clear enough, please let me know

== Deadline ==
Suggestion: Adding content should stop on Thursday, October 14'th at 3:00 PM. Any work after that
should go into formatting, spelling, and grammar checking.

--[[User:Lmundt|Lmundt]] 15:00, 14 October 2010 (UTC)
- I will definitely be adding content after this time probably late, late into the evening.

--[[User:Tafatah|Tafatah]] 19:25, 14 October 2010 (UTC) No problem. Forget about the suggested deadline. I thought we'd have to be done by 11:00Pm.
I am still adding stuff myself. I think Anil will lock the Wiki around 7:00 Am or so. So anytime
before that is Ok.

== Essay Format Take 2 ==
Hello. I am suggesting the following format instead. If you agree, I'll take care of merging the existing info into this new format. My feeling is that this format is
more flexible and will (hopefully) allow individuals to take a section or a sub-section and work on it.

* '''Abstract'''
TO-DO: Main point. Current File Systems are neither versatile enough nor intelligent to handle the rapidly
growing needs of dynamic storage.

TO-DO: few statements regarding the WHYS as to the need for versatile storage (e.g. cloud computing, mobile environments, shifting consumer
demand . . etc )

TO-DO: few statements regarding the need for intelligence (just statements, the body will take care of expanding on these ). E.g. more
intelligent FS’s can include Metadata to help crime investigators, smart FS’s could be self healing . . .etc.

* '''Traditional File Systems'''
** '''Characteristics'''
** '''Limitations'''

* '''Zettabyte File System'''
** '''Characteristics'''
** '''Dissected'''
TO-DO: List the seven components of ZFS and basically what makes a ZFS
E.g. interface, various parts, and external needed libraries . . etc.

** '''Features Beyond Traditional File Systems'''

** '''Possible Real-Life Scenarios / Examples'''
TO-DO: 2-3 examples where ZFS was/could/is being considered for use.

TO-DO : One to two paragraphs stressing / reiterating the main points made in the abstract
thesis statement).

* '''Alternatives to ZFS'''
one example is good enough.
TO-DO: a brief description of the alternative.
Main argument for it’s viability.

** '''Pros/Cons'''
TO-DO: just a list of pluses and minuses

TO-DO : two to three paragraphs summarizing (this is the conclusion) the main points outlined in the abstract and the body, restating why traditional
FS’s are no longer viable, and stressing once more that ZFS is a valid alternative.

== Essay Format ==

I started working on the main page. The bullets are to be expanded. Other group are are working in their respective discussion pages but I think it's all right to put our work in progress on the front page. Thoughts?--[[User:Lmundt|Lmundt]] 16:14, 6 October 2010 (UTC)
* [[User:Gbint|Gbint]] 02:03, 7 October 2010 (UTC) Lmundt; what do you think of listing the capacities of the file system under major features? I was thinking that we could overview the features in brief, then delve into each one individually.
* --[[User:Lmundt|Lmundt]] 14:31, 7 October 2010 (UTC) I was thinking about the major structure... I like what your suggesting in one section. So here is the structure I am thinking of.

* Intro
* Section One ZFS
** Major feature 1
** Major feature 2
** Major feature 3
* Section Two Legacy File Systems
** Legacy File System1( FAT32 ) - what it does
** Legacy File System2( ext2 ) - what it does
** Contrast them with ZFS
* Section Three Current File Systems
** NTFS?
** ext4?
** Contrast them with ZFS
* Section Four future file Systems
** BTRFS
** WinFS or ??
** Contrast them with ZFS
* Conclusion

What does everyone think of this format? While everyone should contribute to section one we could divvy up the rest.

[[User:Gbint|Gbint]] 16:29, 9 October 2010 (UTC) The layout looks good; I filled out the data dedup section. I think it has reasonable coverage while staying away from becoming it's own essay just on deduplication.

The legacy file systems are really not even in the same world as ZFS, so I think the contrasting section should cover a lot of how storage needs have changed.

The current file systems are capable of being expanded into large pools of storage with good redundancy and even advanced features like data deduplication, but they are only a component in a chain of tools (like ext4 + lvm + mdraid + opendedup) rather than an full end-to-end solution.

--[[User:Lmundt|Lmundt]] 23:35, 9 October 2010 (UTC) The section on deduplication looks good I agree it looks like the right amount of coverage for a portion of a single section. Your also right about the old file systems not being able to hold a candle to ZFS and the conclusion section should talk about how storage needs and computers changed. And intro to that section could set the stage for that period as well. Non-multi-threaded, single processor system with much smaller RAM, even the applications were radically different the Internet was just single webpages without the high performance needs of web commerce and online banking for example. I have another assignment so won't be contributing too much until Monday.

--[[User:Tafatah|Tafatah]] 23:54, 10 October 2010 (UTC)
Please take a look at suggested essay format #2 and let me know soon. Time is running out Gents and Ladies :)

--[[User:Lmundt|Lmundt]] 15:35, 11 October 2010 (UTC)
I think I prefer the outline I proposed only because it's a very regimented contrast/compare essay format and should get us any marks for format. Most proper essays don't usually have a dedicated pros cons list. Heading more towards a report format I think. It's really what everyone agrees on. I won't be touching the essay until tomorrow though.

--[[User:Azemanci|Azemanci]] 17:32, 11 October 2010 (UTC)
I like Lmundt's outline. How would you like to divide up the work? Also can everyone post the contact information so we know exactly who is in our group.

--[[User:Tafatah|Tafatah]] 19:03, 11 October 2010 (UTC)
No problem, I'll go with the current format. One issue to keep in mind is that this is an essay, not a report. I.E. the intro/thesis has to include
a reasonable suggestion towards using ZFS as a reliable FS. The body and the conclusion would have to assert that. The current format satisfies that
if we keep these points in mind. I started looking into the "dissect subsection" in the format I suggested, which is related to the ZFS features
section one in the current format. I'll continue to look into that part (above section, who is doing what will be updated accordingly), i.e. I'll
take care of section one since I've already done some work on it. I suggest that each member of the group picks two items from one of the other
sections, except the contrasting part. Content in section one can then be used to finalize the comparisons in each of sections 2-4. The Intro/Abstract
and conclusion sections can be left to the end, and can be done collaboratively. I.E. once we have a very clear picture of all the
different pieces.

--[[User:Azemanci|Azemanci]] 03:18, 12 October 2010 (UTC)
I will begin working on section three current File Systems unless someone else has already begun working on it.

--[[User:Mchou2|Mchou2]] 20:29, 12 October 2010 (UTC)
I am going to start researching for section 2.

--[[User:Azemanci|Azemanci]] 03:15, 13 October 2010 (UTC) Alright so all the sections are being taken care of so we should be good to go for Thursday.

--[[User:Tafatah|Tafatah]] 04:35, 13 October 2010 (UTC) '''No one is assigned to section four''' ? Also, for those who haven't picked any section or subsection, please help out with the sections you're
more familiar with.

Finally, if you were in class today (well, technically yesterday), then you've heard Anil talk about plagiarism. I know this is common knowledge, so forgive
the annoying reminder. Please never copy and paste, and make sure to cite your info. As Anil mentioned, if anyone plagiarises, we are ALL responsible. It is
simply impossible for the rest of the group to check whether every member's sentence is genuine or not. So use your own words/phrases ( doesn't
have to be fancy or sophisticated ). If you're not sure, please check with the rest of the group.

Good luck, and good night.
--Tawfic

--[[User:Azemanci|Azemanci]] 14:55, 13 October 2010 (UTC) My bad I misread something I thought you were doing current file systems section 3. I'll take section 3 but then someone needs to do section 4. There are 4 of us so this should not be a problem.

--[[User:Naseido|Naseido]] 13 October 2010 Sorry I haven't contributed till now. The outline looks great and I think we can spend most of the day tomorrow editing to make sure all the sections fit together like an essay. I'll be doing section 4.

--[[User:Tafatah|Tafatah]] 16:11, 13 October 2010 (UTC) Hi. In section 4 the most important one is BTRFS. More info on that and less info on the others is Ok.

--[[User:Mchou2|Mchou2]] 03:00, 14 October 2010 (UTC)
I have done what I can for the legacy file systems, if someone who doesn't have any particular job wouldn't mind going over it and correcting any errors they see. I am also not familiar with how to edit/format these wiki pages so I tried my best and if you want to change the layout then please do, I would assume after we complete our sections and collaborate them into 1 essay that the formatting will change. I simply put headings on each section just so it is easier to read.

--[[User:Tafatah|Tafatah]] 04:55, 14 October 2010 (UTC) A reference for wiki editing http://meta.wikimedia.org/wiki/Help:Editing

--[[User:Azemanci|Azemanci]] 18:42, 14 October 2010 (UTC) I'm not going to have my info posted by 3:00. Also how and where are we supposed to cite our sources?

--[[User:Tafatah|Tafatah]] 19:28, 14 October 2010 (UTC) No worries. You have till 7:00 Am ( or till Anil locks the Wiki down, though I wouldn't count on more than 7) Friday, Oct 15. For citing, I am using
this convention. Bla.....Bla [Z1. P3] means I am using info from page 3 of article labeled as Z1 in references section.

--[[User:Lmundt|Lmundt]] 20:48, 14 October 2010 (UTC) Citation reference looks good. I think in each section we should talk about the motivation behind the design. A small into for each section. For example in the legacy area fat32 and ext, we could talk about computing becoming commonplace for the average worker so most development was for single users with a relatively small hardrive, not too many files. It will in-turn lead to justify/explanation for the design decission for each vintage of OS.
The main intro should about the motivations behind the design. Intended for servers with a focus or expandability, reliability and self maintenance. This is the motivation behind all those cool features we are detailing.

== Sources ==

Not from your group. Found a file which goes to the heart of your problem
[http://www.oracle.com/technetwork/server-storage/solaris/overview/zfs-14990
2.pdf ZFSDatasheet]
[[User:Gautam|Gautam]] 22:50, 5 October 2010 (UTC)

Thanks will take a look at that.--[[User:Lmundt|Lmundt]] 16:12, 6 October 2010 (UTC)

[[User:Gbint|Gbint]] 01:45, 7 October 2010 (UTC) paper from Sun engineers explaining why they came to build ZFS, the problems they wanted to solve:
* PDF: http://www.timwort.org/classp/200_HTML/docs/zfs_wp.pdf
* HTML: http://74.125.155.132/scholar?q=cache:6Ex3KbFo4lYJ:scholar.google.com/+zettabyte+file+system&hl=en&as_sdt=2000

Excellent article.[[User:Lmundt|Lmundt]] 14:24, 7 October 2010 (UTC)

Not too exciting but it looks like an easy read http://arstechnica.com/hardware/news/2008/03/past-present-future-file-systems.ars [[User:Lmundt|Lmundt]] 14:40, 7 October 2010 (UTC)

the [http://en.wikipedia.org/wiki/Comparison_of_file_systems wikipedia comparison] has some good tables, and if you click the various categories you can learn quite a bit about the various important features //not your group. [[User:Rift|Rift]] 18:56, 7 October 2010 (UTC)

Hey, I'm not from your group but I found this slideshow that was really handy in the assignment! http://www.slideshare.net/Clogeny/zfs-the-last-word-in-filesystems - nshires

------

Hey there. I'm not a member of your group. But you guys might want to look at this Wiki-page from the SolarisInternals website. I used it today for our assignment, a lot of interesting and in-depth breakdown of the ZFS file system: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Performance_Considerations

-- Munther

--[[User:Mchou2|Mchou2]] 03:56, 13 October 2010 (UTC) Good intro to understanding FAT FS
http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf

--[[User:Azemanci|Azemanci]] 18:49, 14 October 2010 (UTC)
Abit late but I found a comparison of current file systems including ZFS:
http://www.idt.mdh.se/kurser/ct3340/ht09/ADMINISTRATION/IRCSE09-submissions/ircse09_submission_16.pdf

http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf