Talk:COMP 3000 Essay 2 2010 Question 1
Class and Notices
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning's class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won't be anyone there) -- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath
- I'm working today, but I'll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP
- I wont be there either. that does not mean i wont/cant contribute. I'll be on msn or you can just email me. -kirill
Group members
Patrick Young Rannath@gmail.com
Daniel Beimers demongyro@gmail.com
Andrew Bown abown2@connect.carleton.ca
Kirill Kashigin k.kashigin@gmail.com
Rovic Perdon rperdon@gmail.com
Daniel Sont dan.sont@gmail.com
Methodology
We should probably have our work verified by at least one group member before posting to the actual page
To Do
- Improve the grammar/structure of the paper section & add links to supplementary info
- Background Concepts -fill in info (fii)
- Contribution -fii
- Critique -fii
- References -fii
Claim Sections
- I claim Exim and memcached for background and critique -Rannath
- also per-core data structures, false sharing and unessesary locking for contribution -Rannath
- For starters I will take the Scalability Tutorial and gmake. Since the part for gmake is short in the paper, I will grab a few more sections later on. - Daniel B.
- Also, I will take sloppy counters as well - Daniel B.
- I'm gonna put some work into the apache and postgresql sections - kirill
- Just as a note Anil in class Thuesday the 30th of November said that we only need to explain 3 of the applications and not all 7 - Andrew
- I'll do the Research problem and contribution sections. - Andrew
- I will work on contribution - Rovic
- I'm gonna whip something up for 4.2 since there appears to be nothing mentioned about it. -kirill
- So here is the claims and unclaimed section. Add your name next to one if you want to take it on.
- gmake - Daniel B.
- memcached - Rannath
- Apache - Kirill
- (Exim, PostgreSQL, Metis, and Psearchy will not be needed as the professor said we only need to explain 3)
- Research Problem - Andrew
- Contribution - Rovic
- Critic, Style - Everyone
- References - Everyone
- Essay Conclusion (also discussion) - Rannath, but I need someone to help flesh it out, I got the salient points down.
Essay
Paper
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.
They all work at MIT CSAIL.
The paper: An Analysis of Linux Scalability to Many Cores
Background Concepts
Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.
Ideas to explain: - thread (maybe) - Linux's move towards scalability precedes this paper. (assert this, no explanation needed, maybe a few examples) - Summarize scalability tutorial (Section 4.1 of the paper) focus on what makes something (non-)scalable - Describe the programs tested (what they do, how they're programmed (serial vs parallel), where to the do their processing)
memchached: Section 3.2
memcached is an in-memory hash table server. One instance running on many cores is bottlenecked by an internal lock. The MIT team ran one instance per-core to avoid the problem. Clients each connect to a single instance. This allows the server to simulate parallelism. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.
Apache: Section 3.3
Apache is a web server. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (Making it a perfect example of parallel programming). One thread to service incoming connections and various other threads to service those connections. On a single core processor, Apache spends 60% of its execution time in the kernel.
gmake: Section 3.5
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake is already quite parallel, creating more processes than cores, so that it can make proper use of multiple cores, and involves much reading and writing of files, as it is used to build the Linux kernel. gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution. gmake spends much of its execution time with its compiler, but still spend 7.6% of its time in system time. [1]
Research problem
my references are just below because it is easier for numbering the data later.
As technological progress the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system[1]. The problem with a standard Linux operating is they are not designed for massive scalability which will soon be a problem. The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic that situation makes sense because 48 cores are dividing the work. But when processing information a process the main goal is to finish so as long as possible every core should be doing a much work as possible.
To fix those scalability issues it is necessary to focus on three major areas: the Linux kernel, user level design and how application use of kernel services. The Linux kernel can be improved be to improve sharing and have the advantage of recent iterations are beginning to implement scalability features. On the user level design applications can be improved so that there is more focus on parallelism since some programs have not implements those improved features. The final aspect of improving scalability is how an application uses kernel services to share resources better so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found actually only take a little work to avoid.[1]
This research is based on much research which was created before in the development of scalability for UNIX system. The major developments from shared memory machines [2], wait-free synchronization to fast message passing have created a base set of techniques which can be used to improve scalability. These techniques have been incorporated in all major operation system including Linux, Mac OS X and Windows. Linux has been improved with kernel subsystems such as Read-Copy-Update which an algorithm for which is used to avoid locks and atomic instructions which lower scalability.[3] The is also an excellent base a research on Linux scalability studies to base this research paper. These paper include a on doing scalability on a 32-core machine. [4] That research can improve the results by learning from the experiments already performed by researchers. This research also aid identifying bottlenecks which speed up researching solutions for those bottlenecks.
[2] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pages 302–313,1994.
[3] P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen, O. Krieger, and R. Russell. Read-copy-update. In Proceedings of the Linux Symposium 2002, pages 338-367, Ottawa Ontario, June 2002
[4] C. Yan, Y. Chen, and S. Yuanchun. OSMark: A benchmark suite for understanding parallel scalability of operating systems on large scale multi-cores. In 2009 2nd International Conference on Computer Science and Information Technology, pages 313–317, 2009
Contribution
What was implemented? Why is it any better than what came before?
Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn't unethical/illegal)? - So long as we cite the paper and don't pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.
-I'm just using this as a notepad, do not copy/paste this section, I will put in a properly written set of paragraphs which will fit with the contribution questions asked. -RP
-==Work in Progress==-- -Rovic P. This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. This has also shown that scaling in application programming should be more the focus. It has been shown that simple scaling techniques (list techniques) such as programming parallelism (look up more stuff to back this up and quotes). (Sloppy counter effectiveness, possible positive contributions, what has been used (internet search), what hasn’t been used.) Read conclusion, 2nd paragraph.
One reason the required changes are modest is that stock Linux already incorporates many modifications to improve scalability. More speculatively, perhaps it is the case that Linux’s system-call API is well suited to an implementation that avoids unnecessary contention over kernel objects.
Section 4.1 problems:
- The percentage of serialization in a program has a lot to do with how much an application can be sped up. As from the example in the paper, it seems to follow Amdahl's law (e.g. 25% serialization --> limit of 4x speedup).
- Types of serializing interactions found in the MOSBENCH apps:
- Locking of shared data structure as the number of cores increase leads to an increase in lock wait time
- Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol
- Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate
- Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources
- Not enough tasks for cores leads to idle cores
Multicore packet processing: Section 4.2
Sloppy counters: Section 4.3
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.
Lock-free comparison: Section 4.4
This section describes a specific instance of unnecessary locking.
Per-Core Data Structures: Section 4.5
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.
Eliminating false sharing: Section 4.6
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.
Avoiding unnecessary locking: Section 4.7
Many locks/mutexes have special cases where they don't need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.
Critique
What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.
Since this is a "my implementation is better then your implementation" paper the "goodness" of content can be impartially determined by its fairness and the honesty of the authors.
Content(Fairness): Section 5
memcached: Section 5.3
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the "stock" and "PK" implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org's wiki suggests using multiple implementations per-server as a work around to another problem, which implies that running multiple instances of the server is the same, or nearly the same, as running one larger server [1]. In the end memcached was bottlenecked by the network card.
[1] memcached's wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache
Apache: Section 5.4
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware. Which is not a problem as the paper specifically states that there are hardware limitations.
gmake: Section 5.6
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system's caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.
Style
Style Criterion (feel free to add I have no idea what should go here): - does the paper present information out of order? - does the paper present needless information? - does the paper have any sections that are inherently confusing? Wrong? or use bad methodology? - is the paper easy to read through, or does it change subjects repeatedly? - does the paper have too many "long-winded" sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences. - Check for grammar
Conclusion: Sections 6 & 7
Everything so far indicates that MOSBENCH application can scale to 48 cores. This scaling required a few modest changes to remove bottlenecks. The MIT team speculate that that trend will continue as the number of cores increase. They also state that things not bottlenecked by the CPU are harder to fix.
We can eliminate most kernel bottlenecks that the applications hits most often with minor changes. Most changes were well known methodology, with the exception of Sloppy counters. This study is limited by the removal of the IO bottleneck, but it does suggest that traditional implementations can be made scalable.
References
[1] Silas Boyd-Wickizer et al. "An Analysis of Linux Scalability to Many Cores". In OSDI '10, 9th USENIX Symposium on OS Design and Implementation, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.
gmake:
Deprecated
Background Concepts
- Exim: Section 3.1:
- Exim is a mail server for Unix. It's fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.
- PostgreSQL: Section 3.4:
- As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.