Talk:COMP 3000 Essay 2 2010 Question 1: Difference between revisions

From Soma-notes
Rannath (talk | contribs)
Rannath (talk | contribs)
 
(330 intermediate revisions by 5 users not shown)
Line 1: Line 1:
=Class and Notices=
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.
=Group members=
=Group members=
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]
Line 9: Line 12:


Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]


=Methodology=
=Methodology=
We should probably have our work verified by at least one group member before posting to the actual page
We should probably have our work verified by at least one group member before posting to the actual page
=To Do=
*Improve the grammar/structure of the paper section
*Background Concepts -fill in info (fii)
*Research problem -fii
*Contribution -fii
*Critique -fii
*References -fii
===Claim Sections===
* I claim Exim and memcached for background and critique -[[Rannath]]
* also per-core data structures, false sharing and unessesary locking for contribution -[[Rannath]]


=Essay=
=Essay=
===Paper===
==Paper - DONE!!!==
The paper's title, authors, and their affiliations. Include a link to the paper and any particularly helpful supplementary information.
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.
 
Authors in order presented: Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich
 
affiliation: MIT CSAIL
 
[http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]
 
===Background Concepts===
Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.
 
Ideas to explain:
#thread (maybe)
#Linux's move towards scalability precedes this paper. (assert this, no explanation needed, maybe a few examples)
#Summarize scalability tutorial (Section 4.1 of the paper)
#Describe the programs tested (what they do, how they're programmed (serial vs parallel), where to the do their processing)
 
=====Exim: ''Section 3.1''=====
Exim is a mail server for Unix. It's fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.
 
=====memchached: ''Section 3.2''=====
memcached is an in-memory hash table. memchached is very much not parallel, but can be made to be, just run multiple instances. Have clients worry about synchronizing data between the different instances. With few requests memcached does most of its processing at the network stack, 80% of its time on one core.


=====Apache: ''Section 3.3''=====
They all work at MIT CSAIL.


=====PostgreSQL: ''Section 3.4''=====
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]


=====gmake: ''Section 3.5''=====
==Background Concepts - DONE!!!==
===memcached: ''Section 3.2''===
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]


=====Psearchy: ''Section 3.6''=====
===Apache: ''Section 3.3''===
Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]


=====Metis: ''Section 3.7''=====
===gmake: ''Section 3.5''===
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]


===Research problem===
[2] http://www.gnu.org/software/make/manual/make.html
What is the research problem being addressed by the paper? How does this problem relate to past related work?


Problem being addressed: scalability of current generation OS architecture, using Linux as an example. (?)
==Research problem - DONE!!!==
As technology progresses, the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system<sup>[[#Foot1|1]]</sup>. The problem with a standard Linux OS is that they are not designed for massive scalability, which will soon prove to be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic states that the situation makes sense because there are 48 cores dividing the work, the information should be processed as fast as possible with each core doing as much work as possible.


Summarize related works (Section 2, include links, expand information to have at least a summary of some related work)
To fix those scalability issues, it is necessary to focus on three major areas: the Linux kernel, user level design and how applications use kernel services. The Linux kernel can be improved by optimizing sharing and use the current advantages of recent improvement to scalability features. On the user level design, applications can be improved so that there is more focus on parallelism since some programs have not implemented those improved features. The final aspect of improving scalability is how an application uses kernel services to better share resources so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found easily and actually only take simple changes to correct or avoid.<sup>[[#Foot1|1]]</sup>


===Contribution===
This research uses a foundation of previous research discovered during the development of scalability in UNIX systems. The major developments from shared memory machines<sup>[[#Foot2|2]]</sup> and wait-free synchronization to fast message passing ended up creating a base set of techniques, which can be used to improve scalability. These techniques have been incorporated in all major operating system including Linux, Mac OS X and Windows. Linux has been improved with kernel subsystems, such as Read-Copy-Update, which is an algorithm that is used to avoid locks and atomic instructions which affect scalability.<sup>[[#Foot3|3]]</sup> There is an excellent base of research on Linux scalability studies that have already been written, on which this research paper can model its testing standards. These papers include research on improving scalability on a 32-core machine.<sup>[[#Foot4|4]]</sup> In addition, the base of studies can be used to improve the results of these experiments by learning from the previous results. This research may also aid in identifying bottlenecks which speed up creating solutions for those problems.
What was implemented? Why is it any better than what came before?


Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn't unethical/illegal)?
==Contribution - DONE!!!==
===MOSBENCH & 16 scalability improvements===
MOSBENCH is a set of application available through MIT. They are designed to measure scalability: "It consists of applications that previous work has shown not to scale well on Linux and applications that are designed for parallel execution and are kernel intensive."<sup>[[#Foot6|6]]</sup> Overall, there were 16 scalability problems encountered by MOSBENCH applications within the scope of the paper and each problem was fixed. The fixes add 2617 lines of code and remove 385 lines of code from Linux.<sup>[[#Foot1|1]]</sup>


Conclusion: we can make a traditional OS architecture scale (at least to 48 cores), we just have to remove bottlenecks.
===Techniques to improve scalability===


=====Multicore packet processing: ''Section 4.2''=====
====What hinders scalability: ''Section 4.1''====
Amdahl's Law states that a parallel program can only be sped up by the inverse of the portion of the program that cannot be made parallel, so, for example, if a program is 50% non-parallel, then at most the program can be sped up to twice the speed using parallelism. So the more serial a program is, the less capability it has for scalability. The main problems found within the MOSBENCH applications that cause serialized interactions were locking of shared data structures, writing to shared memory, competing for space in shared hardware caches, competing for shared hardware resources, and the lack of tasks leading to idle cores. These problems become more evident as more cores are added to the system. The team behind the paper came up with some solutions that either fixed these problems or avoided most, if not all, of the bottlenecks that occurred.<sup>[[#Foot1|1]]</sup>


=====Sloppy counters: ''Section 4.3''=====
====Multicore packet processing: ''Section 4.2''====
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering<sup>[[#Foot8|8]]</sup> to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering<sup>[[#Foot9|9]]</sup> for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.


=====Lock-free comparison: ''Section 4.4''=====
In general, this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios, it's better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel's implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn, Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core's queue is found to be empty, it will attempt to obtain work from queues located on different cores.<sup>[[#Foot1|1]]</sup> This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.


=====Per-Core Data Structures: ''Section 4.5''=====
====Sloppy counters: ''Section 4.3''====
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.<sup>[[#Foot1|1]]</sup>


=====Eliminating false sharing: ''Section 4.6''=====
====Lock-free comparison & Avoiding unnecessary locking: ''Section 4.4 & 4.7''====
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.
The traditional Linux kernel has very low scalability for name lookups in the directory entry cache. This means there is reduced performance in returning information pertaining to a specific file path when there are multiple threads trying to access files in common parent directories due to the kernel serializing the process. This problem is solved in the patched kernel by introducing a new counter to keep track of threads actively looking at the directory entry cache. If a certain thread threatens an entry currently in use by another, the default locking protocol is used to avoid race conditions. If the activities have no bearing on each other the situation is rightfully ignored allowing for much faster access to different different entries in the directory entry cache.<sup>[[#Foot1|1]]</sup>


=====Avoiding unnecessary locking: ''Section 4.7''=====
There are many other locks/mutexes that have special cases where they don't need to lock. Others can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.<sup>[[#Foot1|1]]</sup>
Many locks/mutexes have special cases where they don't need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.


===Critique===
====Per-Core Data Structures: ''Section 4.5''====
What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.
Three centralized data structures were causing bottlenecks due to lock contention - a per-superblock list of open files, vfsmount table, and the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.<sup>[[#Foot1|1]]</sup>


Since this is a "my implementation is better then your implementation" paper the "goodness" of content can be impartially determined by its fairness and the honesty of the authors.
====Eliminating false sharing: ''Section 4.6''====
Misplaced variables on the cache cause different cores to request the same line, but using different variables. With two different cores requesting the same variable, with one reading and the other writing, there was a severe bottleneck. By moving the often written variable to another line the bottleneck was removed.<sup>[[#Foot1|1]]</sup>


Fairness criterion:
===Conclusion===
#does the test accurately describe real-world use-cases (or some set there-of)? (external fairness, maybe ignored for testing and benchmarking purposes, usually is too)
====memcached: ''Section 5.3''====
#does the test put all tested implementations through the same test? (internal fairness)
memcached nearly doubles throughput at 48 cores on the paper's implementation over the stock implementation. After improvements to the Linux kernel memcached is limited by hardware. Improvements to hardware scalability, virtual queue handling in this case, will allow further improvements to memcached.<sup>[[#Foot1|1]]</sup>


Style Criterion (feel free to add I have no idea what should go here):
====Apache: ''Section 5.4''====
#does the paper present information out of order?
Apache keeps fairly a equal amount of throughput up to 36 cores. Then it slopes downwards. At 48 cores it still has an improved throughput of more than 12 times. Apache like memcached is limited by hardware. At higher core count the network card simply cannot handle the number of packets, and the FIFO queue it holds for them overflows.<sup>[[#Foot1|1]]</sup>
#does the paper present needless information?
#does the paper have any sections that are inherently confusing?


=====Fairness: ''Section 5''=====
====gmake: ''Section 5.6''====
Both the stock and new implementations use the same benchmarks, therefore internal fairness is preserved for all seven programs.
gmakes run time is nearly unchanged by the implementation changes presented in this paper. This is largely because the program has serial sections of code and some processes that finish somewhat later then all others, which prevents perfect scalability. Even then it achieves the greatest level of scalability out of the three programs (35X speed up for 48 cores.)<sup>[[#Foot1|1]]</sup>


=======Exim: ''Section 5.2''=======
====No reason to give up traditional kernel design====
The test uses a relatively small number of connections, but that is also implicitly stated to be a non-issue - "as long as there are enough clients to keep Exim busy, the number of clients has little effect on performance."
The contribution of this paper is a lot of research that has focus upon techniques and methods for scalability. This is accomplished through programming of applications alongside kernel programming. This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. In looking at the issue of scalability it is important to note the factors which hinder scalability.  


This test is explicitly stated to be ignoring the real-world constraint of the IO bottleneck, thus is unfair when compared to real-world scenarios. The purpose was not to test the IO bottleneck. Therefore the unfairness to real-world scenarios is unimportant.
It has been shown that simple scaling techniques can be effective in increasing scalability. The authors looked at three different approaches to removing the bottlenecks within the system. The first was to see if there were issues within the linux kernel application, the second was to identify issues with the application design and the third was to address how the application interacts with the linux kernel services. Through this approach, the authors were able to quickly identify problems, such as bottlenecks, and apply simple techniques in fixing the issues at hand to reap some beneficial aspects. Some of the sections listed provide insight into the improvements that can be reaped from these optimizations.  


=======memcached: ''Section 5.3''=======
Through this research on various techniques, as listed above, it was determined by the authors that the Linux kernel itself has many incorporated techniques used to improve scalability. The authors actually go on to speculate that "perhaps it is the case that Linux's system-call API is well suited to an implementation that avoids unnecessary contention over kernel objects."<sup>[[#Foot1|1]]</sup> This tends to show that the work in the Linux community has improved Linux a large amount and is current with modern techniques for optimization.  
memcached has no explicit or implicit fairness concerns with respect to real-world scenarios.


=======Apache: ''Section 5.4''=======
The kernel design does not seem to affect the scalability of the system. While kernel changes do help, they are not enough; the applications must also be scalable. Finally, there must also be improvements to currently non-scalable hardware.


=======PostgreSQL: ''Section 5.5''=======
==Critique - DONE!!!==
Aside from a few acronyms that were unexplained (e.g. Linux TLB) the paper has no real stylistic problems.


=======gmake: ''Section 5.6''=======
===memcached: ''Section 5.3''===
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the "stock" and "PK" implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org's wiki suggests using multiple implementations per-server as a work around to another problem, which implies that there is no great problem with running multiple servers on one machine.<sup>[[#Foot7|7]]</sup>


=======Psearchy: ''Section 5.7''=======
===Apache: ''Section 5.4''===
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware.<sup>[[#Foot1|1]]</sup>


=======Metis: ''Section 5.8''=======
===gmake: ''Section 5.6''===
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system's caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.<sup>[[#Foot1|1]]</sup>


===References===
===Conclusion: ''Sections 6 & 7''===
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.
Each testing case has the same specific parameters, with the only changes being from the stock kernel to the PK kernel. There was one assumption that held true for all tests, the I/O bottleneck was irrelevant to the tests. Thus the conclusion has no concerns with respect to the content.

Latest revision as of 04:33, 3 December 2010

Class and Notices

(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.

Group members

Patrick Young Rannath@gmail.com

Daniel Beimers demongyro@gmail.com

Andrew Bown abown2@connect.carleton.ca

Kirill Kashigin k.kashigin@gmail.com

Rovic Perdon rperdon@gmail.com

Daniel Sont dan.sont@gmail.com

Methodology

We should probably have our work verified by at least one group member before posting to the actual page

Essay

Paper - DONE!!!

This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.

They all work at MIT CSAIL.

The paper: An Analysis of Linux Scalability to Many Cores

Background Concepts - DONE!!!

memcached: Section 3.2

memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]

Apache: Section 3.3

Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]

gmake: Section 3.5

gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]

[2] http://www.gnu.org/software/make/manual/make.html

Research problem - DONE!!!

As technology progresses, the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system1. The problem with a standard Linux OS is that they are not designed for massive scalability, which will soon prove to be a problem. The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic states that the situation makes sense because there are 48 cores dividing the work, the information should be processed as fast as possible with each core doing as much work as possible.

To fix those scalability issues, it is necessary to focus on three major areas: the Linux kernel, user level design and how applications use kernel services. The Linux kernel can be improved by optimizing sharing and use the current advantages of recent improvement to scalability features. On the user level design, applications can be improved so that there is more focus on parallelism since some programs have not implemented those improved features. The final aspect of improving scalability is how an application uses kernel services to better share resources so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found easily and actually only take simple changes to correct or avoid.1

This research uses a foundation of previous research discovered during the development of scalability in UNIX systems. The major developments from shared memory machines2 and wait-free synchronization to fast message passing ended up creating a base set of techniques, which can be used to improve scalability. These techniques have been incorporated in all major operating system including Linux, Mac OS X and Windows. Linux has been improved with kernel subsystems, such as Read-Copy-Update, which is an algorithm that is used to avoid locks and atomic instructions which affect scalability.3 There is an excellent base of research on Linux scalability studies that have already been written, on which this research paper can model its testing standards. These papers include research on improving scalability on a 32-core machine.4 In addition, the base of studies can be used to improve the results of these experiments by learning from the previous results. This research may also aid in identifying bottlenecks which speed up creating solutions for those problems.

Contribution - DONE!!!

MOSBENCH & 16 scalability improvements

MOSBENCH is a set of application available through MIT. They are designed to measure scalability: "It consists of applications that previous work has shown not to scale well on Linux and applications that are designed for parallel execution and are kernel intensive."6 Overall, there were 16 scalability problems encountered by MOSBENCH applications within the scope of the paper and each problem was fixed. The fixes add 2617 lines of code and remove 385 lines of code from Linux.1

Techniques to improve scalability

What hinders scalability: Section 4.1

Amdahl's Law states that a parallel program can only be sped up by the inverse of the portion of the program that cannot be made parallel, so, for example, if a program is 50% non-parallel, then at most the program can be sped up to twice the speed using parallelism. So the more serial a program is, the less capability it has for scalability. The main problems found within the MOSBENCH applications that cause serialized interactions were locking of shared data structures, writing to shared memory, competing for space in shared hardware caches, competing for shared hardware resources, and the lack of tasks leading to idle cores. These problems become more evident as more cores are added to the system. The team behind the paper came up with some solutions that either fixed these problems or avoided most, if not all, of the bottlenecks that occurred.1

Multicore packet processing: Section 4.2

Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering8 to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering9 for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.

In general, this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios, it's better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel's implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn, Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core's queue is found to be empty, it will attempt to obtain work from queues located on different cores.1 This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.

Sloppy counters: Section 4.3

Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.1

Lock-free comparison & Avoiding unnecessary locking: Section 4.4 & 4.7

The traditional Linux kernel has very low scalability for name lookups in the directory entry cache. This means there is reduced performance in returning information pertaining to a specific file path when there are multiple threads trying to access files in common parent directories due to the kernel serializing the process. This problem is solved in the patched kernel by introducing a new counter to keep track of threads actively looking at the directory entry cache. If a certain thread threatens an entry currently in use by another, the default locking protocol is used to avoid race conditions. If the activities have no bearing on each other the situation is rightfully ignored allowing for much faster access to different different entries in the directory entry cache.1

There are many other locks/mutexes that have special cases where they don't need to lock. Others can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.1

Per-Core Data Structures: Section 4.5

Three centralized data structures were causing bottlenecks due to lock contention - a per-superblock list of open files, vfsmount table, and the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.1

Eliminating false sharing: Section 4.6

Misplaced variables on the cache cause different cores to request the same line, but using different variables. With two different cores requesting the same variable, with one reading and the other writing, there was a severe bottleneck. By moving the often written variable to another line the bottleneck was removed.1

Conclusion

memcached: Section 5.3

memcached nearly doubles throughput at 48 cores on the paper's implementation over the stock implementation. After improvements to the Linux kernel memcached is limited by hardware. Improvements to hardware scalability, virtual queue handling in this case, will allow further improvements to memcached.1

Apache: Section 5.4

Apache keeps fairly a equal amount of throughput up to 36 cores. Then it slopes downwards. At 48 cores it still has an improved throughput of more than 12 times. Apache like memcached is limited by hardware. At higher core count the network card simply cannot handle the number of packets, and the FIFO queue it holds for them overflows.1

gmake: Section 5.6

gmakes run time is nearly unchanged by the implementation changes presented in this paper. This is largely because the program has serial sections of code and some processes that finish somewhat later then all others, which prevents perfect scalability. Even then it achieves the greatest level of scalability out of the three programs (35X speed up for 48 cores.)1

No reason to give up traditional kernel design

The contribution of this paper is a lot of research that has focus upon techniques and methods for scalability. This is accomplished through programming of applications alongside kernel programming. This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. In looking at the issue of scalability it is important to note the factors which hinder scalability.

It has been shown that simple scaling techniques can be effective in increasing scalability. The authors looked at three different approaches to removing the bottlenecks within the system. The first was to see if there were issues within the linux kernel application, the second was to identify issues with the application design and the third was to address how the application interacts with the linux kernel services. Through this approach, the authors were able to quickly identify problems, such as bottlenecks, and apply simple techniques in fixing the issues at hand to reap some beneficial aspects. Some of the sections listed provide insight into the improvements that can be reaped from these optimizations.

Through this research on various techniques, as listed above, it was determined by the authors that the Linux kernel itself has many incorporated techniques used to improve scalability. The authors actually go on to speculate that "perhaps it is the case that Linux's system-call API is well suited to an implementation that avoids unnecessary contention over kernel objects."1 This tends to show that the work in the Linux community has improved Linux a large amount and is current with modern techniques for optimization.

The kernel design does not seem to affect the scalability of the system. While kernel changes do help, they are not enough; the applications must also be scalable. Finally, there must also be improvements to currently non-scalable hardware.

Critique - DONE!!!

Aside from a few acronyms that were unexplained (e.g. Linux TLB) the paper has no real stylistic problems.

memcached: Section 5.3

memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the "stock" and "PK" implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org's wiki suggests using multiple implementations per-server as a work around to another problem, which implies that there is no great problem with running multiple servers on one machine.7

Apache: Section 5.4

Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware.1

gmake: Section 5.6

Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system's caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.1

Conclusion: Sections 6 & 7

Each testing case has the same specific parameters, with the only changes being from the stock kernel to the PK kernel. There was one assumption that held true for all tests, the I/O bottleneck was irrelevant to the tests. Thus the conclusion has no concerns with respect to the content.