Class and Notices

(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.

- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning's class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.

- I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions? - Daniel B.

Group members

Patrick Young Rannath@gmail.com

Daniel Beimers demongyro@gmail.com

Andrew Bown abown2@connect.carleton.ca

Kirill Kashigin k.kashigin@gmail.com

Rovic Perdon rperdon@gmail.com

Daniel Sont dan.sont@gmail.com

Methodology

We should probably have our work verified by at least one group member before posting to the actual page

To Do

Improve the grammar/structure of the paper section & add links to supplementary info
Background Concepts -fill in info (fii)
Research problem -fii
Contribution -fii
Critique -fii
References -fii

Claim Sections

I claim Exim and memcached for background and critique -Rannath
also per-core data structures, false sharing and unessesary locking for contribution -Rannath
For starters I will take the Scalability Tutorial and gmake. Since the part for gmake is short in the paper, I will grab a few more sections later on. - Daniel B.
Also, I will take sloppy counters as well - Daniel B.
I'm gonna put some work into the apache and postgresql sections - kirill
Just as a note Anil in class Thuesday the 30th of November said that we only need to explain 3 of the applications and not all 7 - Andrew
I'll do the Research problem and contribution sections. - Andrew
I will work on contribution - Rovic

Essay

Paper

This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.

They all work at MIT CSAIL.

The paper: An Analysis of Linux Scalability to Many Cores

Background Concepts

Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.

Ideas to explain:
- thread (maybe)
- Linux's move towards scalability precedes this paper. (assert this, no explanation needed, maybe a few examples)
- Summarize scalability tutorial (Section 4.1 of the paper) focus on what makes something (non-)scalable
- Describe the programs tested (what they do, how they're programmed (serial vs parallel), where to the do their processing)

Exim: Section 3.1

Exim is a mail server for Unix. It's fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.

memchached: Section 3.2

memcached is an in-memory hash table. memchached is very much not parallel, but can be made to be, just run multiple instances. Have clients worry about synchronizing data between the different instances. With few requests memcached does most of its processing at the network stack, 80% of its time on one core.

Apache: Section 3.3

Apache is a web server. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (Making it a perfect example of parallel programming). One thread to service incoming connections and various other threads to service those connections. On a single core processor, Apache spends 60% of its execution time in the kernel.

PostgreSQL: Section 3.4

As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.

gmake: Section 3.5

Psearchy: Section 3.6

Metis: Section 3.7

Research problem

What is the research problem being addressed by the paper? How does this problem relate to past related work?

Problem being addressed: scalability of current generation OS architecture, using Linux as an example. (?)

Summarize related works (Section 2, include links, expand information to have at least a summary of some related work)

Section 4.1 problems:
- The percentage of serialization in a program has a lot to do with how much an application can be sped up. As from the example in the paper, it seems to be an inverse relationship (e.g. 25% serialization --> limit of 4x speedup).
- Types of serializing interactions found in the MOSBENCH apps:
  - Locking of shared data structure - increasing # of cores --> increase in lock wait time
  - Writing to shared memory - increasing # of cores --> increase in wait for cache coherence protocol
  - Competing for space in shared hardware cache - increasing # of cores --> increase in cache miss rate
  - Competing for shared hardware resources - increasing # of cores --> increase in wait for resources
  - Not enough tasks for cores --> idle cores

Contribution

What was implemented? Why is it any better than what came before?

Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn't unethical/illegal)?
  - So long as we cite the paper and don't pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.

Conclusion: we can make a traditional OS architecture scale (at least to 48 cores), we just have to remove bottlenecks.

Multicore packet processing: Section 4.2

Sloppy counters: Section 4.3

Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.

Lock-free comparison: Section 4.4

Per-Core Data Structures: Section 4.5

Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.

Eliminating false sharing: Section 4.6

Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.

Avoiding unnecessary locking: Section 4.7

Many locks/mutexes have special cases where they don't need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.

Critique

What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.

Since this is a "my implementation is better then your implementation" paper the "goodness" of content can be impartially determined by its fairness and the honesty of the authors.

Content(Fairness): Section 5

Fairness criterion:
- does the test accurately describe real-world use-cases (or some set there-of)? (external fairness, maybe ignored for testing and benchmarking purposes, usually is too)
- does the test put all tested implementations through the same test? (internal fairness)

Both the stock and new implementations use the same benchmarks, therefore neither of them has a particular advantage. That holds true for all seven programs.

Exim: Section 5.2

The test uses a relatively small number of connections, but that is also implicitly stated to be a non-issue - "as long as there are enough clients to keep Exim busy, the number of clients has little effect on performance."

This test is explicitly stated to be ignoring the real-world constraint of the IO bottleneck, thus is unfair when compared to real-world scenarios. The purpose was not to test the IO bottleneck. Therefore the unfairness to real-world scenarios is unimportant.

memcached: Section 5.3

memcached has no explicit or implicit fairness concerns with respect to real-world scenarios.

Apache: Section 5.4

Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware.

PostgreSQL: Section 5.5

gmake: Section 5.6

Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system's caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is used with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake.

Psearchy: Section 5.7

Metis: Section 5.8

Style

Style Criterion (feel free to add I have no idea what should go here):
- does the paper present information out of order?
- does the paper present needless information?
- does the paper have any sections that are inherently confusing?
- is the paper easy to read through, or does it change subjects repeatedly?
- does the paper have too many "long-winded" sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.
- Check for grammar

References

You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.