<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://homeostasis.scs.carleton.ca/wiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Kkashigi</id>
	<title>Soma-notes - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://homeostasis.scs.carleton.ca/wiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Kkashigi"/>
	<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php/Special:Contributions/Kkashigi"/>
	<updated>2026-05-03T15:00:16Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.42.1</generator>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6712</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6712"/>
		<updated>2010-12-03T03:42:12Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Conclusion: Sections 6 &amp;amp; 7 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*The only thing we need to do is determine how fair the conclusion is.&lt;br /&gt;
&lt;br /&gt;
*Thank God [[SIR DANIEL SONT OF OTTAWA, pay attention here]] I need you to take a look at the conclusion and help determine the &amp;quot;fairness&amp;quot; of it.&lt;br /&gt;
&lt;br /&gt;
*We also need to make sure everything is moved over to the actual essay page, and once that is done, that all the references are done correctly.&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts - DONE!!!==&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem - DONE!!!==&lt;br /&gt;
As technology progresses, the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;. The problem with a standard Linux OS is that they are not designed for massive scalability, which will soon prove to be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic states that the situation makes sense because there are 48 cores dividing the work, the information should be processed as fast as possible with each core doing as much work as possible.&lt;br /&gt;
&lt;br /&gt;
To fix those scalability issues, it is necessary to focus on three major areas: the Linux kernel, user level design and how applications use kernel services. The Linux kernel can be improved by optimizing sharing and use the current advantages of recent improvement to scalability features. On the user level design, applications can be improved so that there is more focus on parallelism since some programs have not implemented those improved features. The final aspect of improving scalability is how an application uses kernel services to better share resources so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found easily and actually only take simple changes to correct or avoid.&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This research uses a foundation of previous research discovered during the development of scalability in UNIX systems. The major developments from shared memory machines&amp;lt;sup&amp;gt;[[#Foot2|2]]&amp;lt;/sup&amp;gt; and wait-free synchronization to fast message passing ended up creating a base set of techniques, which can be used to improve scalability. These techniques have been incorporated in all major operating system including Linux, Mac OS X and Windows. Linux has been improved with kernel subsystems, such as Read-Copy-Update, which is an algorithm that is used to avoid locks and atomic instructions which affect scalability.&amp;lt;sup&amp;gt;[[#Foot3|3]]&amp;lt;/sup&amp;gt; There is an excellent base of research on Linux scalability studies that have already been written, on which this research paper can model its testing standards. These papers include research on improving scalability on a 32-core machine.&amp;lt;sup&amp;gt;[[#Foot4|4]]&amp;lt;/sup&amp;gt; In addition, the base of studies can be used to improve the results of these experiments by learning from the previous results. This research may also aid in identifying bottlenecks which speed up creating solutions for those problems.&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
===16 scalability improvements===&lt;br /&gt;
There were 16 scalability problems encountered by MOSBENCH applications within the scope of the paper. Each was fixed. The fixes add 2617 lines of code and remove 385 lines of code from Linux.&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===MOSBENCH===&lt;br /&gt;
MOSBENCH is a set of application available through MIT. They are designed to measure scalability: &amp;quot;It consists of applications that previous work has shown not to scale well on Linux and applications that are designed for parallel execution and are kernel intensive.&amp;quot;&amp;lt;sup&amp;gt;[[#Foot6|6]]&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Techniques to improve scalability===&lt;br /&gt;
&lt;br /&gt;
====What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;====&lt;br /&gt;
Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the portion of the program that cannot be made parallel, so, for example, if a program is 50% non-parallel, then at most the program can be sped up to twice the speed using parallelism. So the more serial a program is, the less capability it has for scalability. The main problems found within the MOSBENCH applications that cause serialized interactions were locking of shared data structures, writing to shared memory, competing for space in shared hardware caches, competing for shared hardware resources, and the lack of tasks leading to idle cores. These problems become more evident as more cores are added to the system. The team behind the paper came up with some solutions that either fixed these problems or avoided most, if not all, of the bottle-necking that occurred.&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;====&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering[1] to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering[2] for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[1] J. Corbet. Receive Packet Steering, November 2009. http://lwn.net/Articles/362339/.&lt;br /&gt;
&lt;br /&gt;
[2] J. Edge. Receive Flow Steering, April 2010. http://lwn.net/Articles/382428/.&lt;br /&gt;
&lt;br /&gt;
====Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;====&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Lock-free comparison &amp;amp; Avoiding unnecessary locking: &#039;&#039;Section 4.4 &amp;amp; 4.7&#039;&#039;====&lt;br /&gt;
The traditional Linux kernel has very low scalability for name lookups in the directory entry cache. This means there is reduced performance in returning information pertaining to a specific file path when there are multiple threads trying to access files in common parent directories due to the kernel serializing the process. This problem is solved in the patched kernel by introducing a new counter to keep track of threads actively looking at the directory entry cache. If a certain thread threatens an entry currently in use by another, the default locking protocol is used to avoid race conditions. If the activities have no bearing on each other the situation is rightfully ignored allowing for much faster access to different different entries in the directory entry cache.&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are many other locks/mutexes that have special cases where they don&#039;t need to lock. Others can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
====Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;====&lt;br /&gt;
Three centralized data structures were causing bottlenecks due to lock contention - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
====Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;====&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line, but different variables. With two different cores requesting the same variable one reading, the other writing there was a severe bottleneck. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===Conclusion===&lt;br /&gt;
====memcached: &#039;&#039;Section 5.3&#039;&#039;====&lt;br /&gt;
memcached nearly doubles throughput at 48 cores on the paper&#039;s implementation over the stock implementation. After improvements to the Linux kernel memcached is limited by hardware. Improvements to hardware scalability, virtual queue handling in this case, will allow further improvements to memcached.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 5.4&#039;&#039;====&lt;br /&gt;
Apache keeps fairly a equal amount of throughput up to 36 cores. Then it slopes downwards. At 48 cores it still has an improved throughput of more than 12 times. Apache like memcached is limited by hardware. At higher core count the network card simply cannot handle the number of packets, and the FIFO queue it holds for them overflows.&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 5.6&#039;&#039;====&lt;br /&gt;
gmakes run time is nearly unchanged by the implementation changes presented in this paper. This is largely because the program has serial sections of code and some processes that finish somewhat later then all others, which prevents perfect scalability. Even then it achieves the greatest level of scalability out of the three programs (35X speed up for 48 cores.)&lt;br /&gt;
&lt;br /&gt;
====No reason to give up traditional kernel design====&lt;br /&gt;
The contribution of this paper is a lot of research that has focus upon techniques and methods for scalability. This is accomplished through programming of applications alongside kernel programming. This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. In looking at the issue of scalability it is important to note the factors which hinder scalability. &lt;br /&gt;
&lt;br /&gt;
It has been shown that simple scaling techniques can be effective in increasing scalability. The authors looked at three different approaches to removing the bottlenecks within the system. The first was to see if there were issues within the linux kernel application, the second was to identify issues with the application design and the third was to address how the application interacts with the linux kernel services. Through this approach, the authors were able to quickly identify problems such as bottlenecks and apply simple techniques in fixing the issues at hand to reap some beneficial aspects. Some of the sections listed provide insight into the improvements that can be reaped from these optimizations. &lt;br /&gt;
&lt;br /&gt;
Through this research on various techniques as listed above, it was determined by the authors that the Linux kernel itself has many incorporated techniques used to improve scalability. The authors actually go on to speculate that &amp;quot;perhaps it is the case that Linux&#039;s system-call API is well suited to an implementation that avoids unnecessary contention over kernel objects.&amp;quot; This tends to show that the work in the Linux community has improved Linux a large amount and is current with modern techniques for optimization. &lt;br /&gt;
&lt;br /&gt;
It could also be interpreted from the paper that it may be to the benefit of the community to change how the applications are programmed rather than make changes to the Linux kernel in order to make scalability improvements. This may indicate that what has come before is done quite well when considering that the Linux kernel optimizations showed more improvement when put in conjunction with the application improvements.&lt;br /&gt;
&lt;br /&gt;
Thus, Linux is only as scalable as its applications. If the kernel design is a limiting factor there is no indication of it. At the end of the experiment all limits on scalability were either application side or in IO.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
Aside from a few acronyms that were unexplained (e.g. Linux TLB) the paper has no real stylistic problems.&lt;br /&gt;
&lt;br /&gt;
===memcached: &#039;&#039;Section 5.3&#039;&#039;===&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that there is no great problem with running multiple servers on one machine [3].&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 5.4&#039;&#039;===&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware.&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 5.6&#039;&#039;===&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
===Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;===&lt;br /&gt;
There is no faulty logic in the conclusion, no out right lies, no fallacies. Thus its fair. I do not know however if its valid. -[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
&amp;quot;It could also be interpreted from the paper that it may be to the benefit of the community to change how the applications are programmed rather than make changes to the Linux kernel in order to make scalability improvements.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
I think that needs to be rephrased. In certain cases changes were needed to be made at kernel level to improve performance along side application code. I think the point of the article was that no changes needed to be made to the kernel design but the kernel itself obviously needs work. -kirill&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
[7] MOSBENCH: http://pdos.csail.mit.edu/mosbench/&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6686</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6686"/>
		<updated>2010-12-03T03:16:29Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* To Do */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*contribution (conclusion mostly): just need to re-work some sections and follow the cues I left in the Conclusion section. &lt;br /&gt;
*critique (conclusion mostly): critique the conclusion of the essay&lt;br /&gt;
*style: the style section is largely untouched. Daniel and I([[Rannath]]) have puts some thoughts there, but that section needs to be made into sentences.&lt;br /&gt;
*yea i didn&#039;t bother rereading for changes my b, just trying to make sure our information is valid ya kno? on that note what are we looking at to get thing this on the final page?&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts - DONE!!!==&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem - DONE!!!==&lt;br /&gt;
As technology progresses, the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;. The problem with a standard Linux OS is that they are not designed for massive scalability, which will soon prove to be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic states that the situation makes sense because there are 48 cores dividing the work, the information should be processed as fast as possible with each core doing as much work as possible.&lt;br /&gt;
&lt;br /&gt;
To fix those scalability issues, it is necessary to focus on three major areas: the Linux kernel, user level design and how applications use kernel services. The Linux kernel can be improved by optimizing sharing and use the current advantages of recent improvement to scalability features. On the user level design, applications can be improved so that there is more focus on parallelism since some programs have not implemented those improved features. The final aspect of improving scalability is how an application uses kernel services to better share resources so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found easily and actually only take simple changes to correct or avoid.&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This research uses a foundation of previous research discovered during the development of scalability in UNIX systems. The major developments from shared memory machines&amp;lt;sup&amp;gt;[[#Foot2|2]]&amp;lt;/sup&amp;gt; and wait-free synchronization to fast message passing ended up creating a base set of techniques, which can be used to improve scalability. These techniques have been incorporated in all major operating system including Linux, Mac OS X and Windows. Linux has been improved with kernel subsystems, such as Read-Copy-Update, which is an algorithm that is used to avoid locks and atomic instructions which affect scalability.&amp;lt;sup&amp;gt;[[#Foot3|3]]&amp;lt;/sup&amp;gt; There is an excellent base of research on Linux scalability studies that have already been written, on which this research paper can model its testing standards. These papers include research on improving scalability on a 32-core machine.&amp;lt;sup&amp;gt;[[#Foot4|4]]&amp;lt;/sup&amp;gt; In addition, the base of studies can be used to improve the results of these experiments by learning from the previous results. This research may also aid in identifying bottlenecks which speed up creating solutions for those problems.&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
===16 scalability improvements===&lt;br /&gt;
There were 16 scalability problems encountered by MOSBENCH applications. Each was fixed. The fixes add 2617 lines of code and remove 385 lines of code from Linux.&lt;br /&gt;
&lt;br /&gt;
===MOSBENCH===&lt;br /&gt;
MOSBENCH is a set of application available through MIT. They are designed to measure scalability: &amp;quot;It consists of applications that previous work has shown not to scale well on Linux and applications that are designed for parallel execution and are kernel intensive.&amp;quot;[7]&lt;br /&gt;
&lt;br /&gt;
===Techniques to improve scalability===&lt;br /&gt;
&lt;br /&gt;
====What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;====&lt;br /&gt;
*The percentage of serialization in a program has a lot to do with how much an application can be sped up. This is Amdahl&#039;s Law&lt;br /&gt;
** Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the proportion of the program that cannot be made parallel (e.g. 25%(.25) non-parallel --&amp;gt; limit of 4x speedup) (I can&#039;t get this to sound right someone fix it please -[[Rannath]] &amp;lt;- I will fix [[Daniel B.]]&lt;br /&gt;
*Types of serializing interactions found in the MOSBENCH apps:	 &lt;br /&gt;
**Locking of shared data structure as the number of cores increase leads to an increase in lock wait time	 &lt;br /&gt;
**Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol	 &lt;br /&gt;
**Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate	 &lt;br /&gt;
**Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources	 &lt;br /&gt;
**Not enough tasks for cores leads to idle cores&lt;br /&gt;
&lt;br /&gt;
====Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;====&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering[1] to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering[2] for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[1] J. Corbet. Receive Packet Steering, November 2009. http://lwn.net/Articles/362339/.&lt;br /&gt;
&lt;br /&gt;
[2] J. Edge. Receive Flow Steering, April 2010. http://lwn.net/Articles/382428/.&lt;br /&gt;
&lt;br /&gt;
====Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;====&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
====Lock-free comparison &amp;amp; Avoiding unnecessary locking: &#039;&#039;Section 4.4 &amp;amp; 4.7&#039;&#039;====&lt;br /&gt;
The traditional Linux kernel has very low scalability for name lookups in the directory entry cache. This means there is reduced performance in returning information pertaining to a specific file path when there are multiple threads trying to access files in common parent directories due to the kernel serializing the process. This problem is solved in the patched kernel by introducing a new counter to keep track of threads actively looking at the directory entry cache. If a certain thread threatens an entry currently in use by another, the default locking protocol is used to avoid race conditions. If the activities have no bearing on each other the situation is rightfully ignored allowing for much faster access to different different entries in the directory entry cache.&lt;br /&gt;
&lt;br /&gt;
There are many other locks/mutexes that have special cases where they don&#039;t need to lock. Others can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
====Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;====&lt;br /&gt;
Three centralized data structures were causing bottlenecks due to lock contention - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
====Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;====&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line, but different variables. With two different cores requesting the same variable one reading, the other writing there was a severe bottleneck. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===Conclusion===&lt;br /&gt;
====memcached: &#039;&#039;Section 5.3&#039;&#039;====&lt;br /&gt;
memcached nearly doubles throughput at 48 cores on the paper&#039;s implementation over the stock implementation. After improvements to the Linux kernel memcached is limited by hardware. Improvements to hardware scalability, virtual queue handling in this case, will allow further improvements to memcached.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 5.4&#039;&#039;====&lt;br /&gt;
Apache keeps fairly a equal amount of throughput up to 36 cores. Then it slopes downwards. At 48 cores it still has an improved throughput of more than 12 times. Apache like memcached is limited by hardware. At higher core count the network card simply cannot handle the number of packets, and the FIFO queue it holds for them overflows.&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 5.6&#039;&#039;====&lt;br /&gt;
gmakes run time is nearly unchanged by the implementation changes presented in this paper. This is largely because the program has serial sections of code and some processes that finish somewhat later then all others, which prevents perfect scalability. Even then it achieves the greatest level of scalability out of the three programs (35X speed up for 48 cores.)&lt;br /&gt;
&lt;br /&gt;
====No reason to give up traditional kernel design====&lt;br /&gt;
&lt;br /&gt;
The contribution of this paper is a lot of research that has focus upon techniques and methods for scalability. This is accomplished through programming of applications alongside kernel programming. This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. In looking at the issue of scalability it is important to note the causes of the factors which hinder scalability.&lt;br /&gt;
&lt;br /&gt;
It has been shown that simple scaling techniques can be effective in increasing scalability. The authors looked at three different approaches to removing the bottlenecks within the system. The first was to see if there were issues within the linux kernel application, the second was to identify issues with the application design and the third was to address how the application interacts with the linux kernel services. Through this approach, the authors were able to quickly identify problems such as bottlenecks and apply simple techniques in fixing the issues at hand to reap some beneficial aspects. Some of the sections listed provide insight into the improvements that can be reaped from these optimizations.&lt;br /&gt;
&lt;br /&gt;
Through this research on various techniques as listed above, it was determined by the authors that the Linux kernel itself has many incorporated techniques used to improve scalability. The authors actually go on to speculate that &amp;quot;perhaps it is the case that Linux&#039;s system-call API is well suited to an implementation that avoids unnecessary contention over kernel objects.&amp;quot; This tends to show that the work in the Linux community has improved Linux a large amount and is current with modern techniques for optimization. It could also be interpreted from the paper that it may be to the benefit of the community to change how the applications are programmed rather than make changes to the Linux kernel in order to make scalability improvements. This may indicate that what has come before is done quite well when considering that the Linux kernel optimizations showed more improvement when put in conjunction with the application improvements.&lt;br /&gt;
&lt;br /&gt;
 The hypothesis of the paper was the viability of traditional kernel design. This should explicitly tie into that.&lt;br /&gt;
 Also the fact that more then half of the programs were limited by IO should be mentioned.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
Aside from a few acronyms that were unexplained (e.g. Linux TLB) the paper has no real stylistic problems.&lt;br /&gt;
&lt;br /&gt;
===memcached: &#039;&#039;Section 5.3&#039;&#039;===&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that there is no great problem with running multiple servers on one machine [3].&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 5.4&#039;&#039;===&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware.&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 5.6&#039;&#039;===&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
===Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;===&lt;br /&gt;
Given that all tests are more or less fair for the purposes of the benchmarks. They would support the Hypothesis that Linux can be made to scale, at least to 48 cores. Thus the conclusion is fair iff the rest of the paper is fair.&lt;br /&gt;
&lt;br /&gt;
 Now you just have to fill in how fair the rest of the paper is.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
[7] MOSBENCH: http://pdos.csail.mit.edu/mosbench/&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6685</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6685"/>
		<updated>2010-12-03T03:14:29Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Apache: Section 5.4 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*contribution (conclusion mostly): just need to re-work some sections and follow the cues I left in the Conclusion section. &lt;br /&gt;
*critique (conclusion mostly): critique the conclusion of the essay&lt;br /&gt;
*style: the style section is largely untouched. Daniel and I([[Rannath]]) have puts some thoughts there, but that section needs to be made into sentences.&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts - DONE!!!==&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem - DONE!!!==&lt;br /&gt;
As technology progresses, the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;. The problem with a standard Linux OS is that they are not designed for massive scalability, which will soon prove to be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic states that the situation makes sense because there are 48 cores dividing the work, the information should be processed as fast as possible with each core doing as much work as possible.&lt;br /&gt;
&lt;br /&gt;
To fix those scalability issues, it is necessary to focus on three major areas: the Linux kernel, user level design and how applications use kernel services. The Linux kernel can be improved by optimizing sharing and use the current advantages of recent improvement to scalability features. On the user level design, applications can be improved so that there is more focus on parallelism since some programs have not implemented those improved features. The final aspect of improving scalability is how an application uses kernel services to better share resources so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found easily and actually only take simple changes to correct or avoid.&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This research uses a foundation of previous research discovered during the development of scalability in UNIX systems. The major developments from shared memory machines&amp;lt;sup&amp;gt;[[#Foot2|2]]&amp;lt;/sup&amp;gt; and wait-free synchronization to fast message passing ended up creating a base set of techniques, which can be used to improve scalability. These techniques have been incorporated in all major operating system including Linux, Mac OS X and Windows. Linux has been improved with kernel subsystems, such as Read-Copy-Update, which is an algorithm that is used to avoid locks and atomic instructions which affect scalability.&amp;lt;sup&amp;gt;[[#Foot3|3]]&amp;lt;/sup&amp;gt; There is an excellent base of research on Linux scalability studies that have already been written, on which this research paper can model its testing standards. These papers include research on improving scalability on a 32-core machine.&amp;lt;sup&amp;gt;[[#Foot4|4]]&amp;lt;/sup&amp;gt; In addition, the base of studies can be used to improve the results of these experiments by learning from the previous results. This research may also aid in identifying bottlenecks which speed up creating solutions for those problems.&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
===16 scalability improvements===&lt;br /&gt;
There were 16 scalability problems encountered by MOSBENCH applications. Each was fixed. The fixes add 2617 lines of code and remove 385 lines of code from Linux.&lt;br /&gt;
&lt;br /&gt;
===MOSBENCH===&lt;br /&gt;
MOSBENCH is a set of application available through MIT. They are designed to measure scalability: &amp;quot;It consists of applications that previous work has shown not to scale well on Linux and applications that are designed for parallel execution and are kernel intensive.&amp;quot;[7]&lt;br /&gt;
&lt;br /&gt;
===Techniques to improve scalability===&lt;br /&gt;
&lt;br /&gt;
====What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;====&lt;br /&gt;
*The percentage of serialization in a program has a lot to do with how much an application can be sped up. This is Amdahl&#039;s Law&lt;br /&gt;
** Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the proportion of the program that cannot be made parallel (e.g. 25%(.25) non-parallel --&amp;gt; limit of 4x speedup) (I can&#039;t get this to sound right someone fix it please -[[Rannath]] &amp;lt;- I will fix [[Daniel B.]]&lt;br /&gt;
*Types of serializing interactions found in the MOSBENCH apps:	 &lt;br /&gt;
**Locking of shared data structure as the number of cores increase leads to an increase in lock wait time	 &lt;br /&gt;
**Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol	 &lt;br /&gt;
**Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate	 &lt;br /&gt;
**Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources	 &lt;br /&gt;
**Not enough tasks for cores leads to idle cores&lt;br /&gt;
&lt;br /&gt;
====Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;====&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering[1] to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering[2] for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[1] J. Corbet. Receive Packet Steering, November 2009. http://lwn.net/Articles/362339/.&lt;br /&gt;
&lt;br /&gt;
[2] J. Edge. Receive Flow Steering, April 2010. http://lwn.net/Articles/382428/.&lt;br /&gt;
&lt;br /&gt;
====Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;====&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
====Lock-free comparison &amp;amp; Avoiding unnecessary locking: &#039;&#039;Section 4.4 &amp;amp; 4.7&#039;&#039;====&lt;br /&gt;
The traditional Linux kernel has very low scalability for name lookups in the directory entry cache. This means there is reduced performance in returning information pertaining to a specific file path when there are multiple threads trying to access files in common parent directories due to the kernel serializing the process. This problem is solved in the patched kernel by introducing a new counter to keep track of threads actively looking at the directory entry cache. If a certain thread threatens an entry currently in use by another, the default locking protocol is used to avoid race conditions. If the activities have no bearing on each other the situation is rightfully ignored allowing for much faster access to different different entries in the directory entry cache.&lt;br /&gt;
&lt;br /&gt;
There are many other locks/mutexes that have special cases where they don&#039;t need to lock. Others can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
====Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;====&lt;br /&gt;
Three centralized data structures were causing bottlenecks due to lock contention - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
====Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;====&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line, but different variables. With two different cores requesting the same variable one reading, the other writing there was a severe bottleneck. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===Conclusion===&lt;br /&gt;
====memcached: &#039;&#039;Section 5.3&#039;&#039;====&lt;br /&gt;
memcached nearly doubles throughput at 48 cores on the paper&#039;s implementation over the stock implementation. After improvements to the Linux kernel memcached is limited by hardware. Improvements to hardware scalability, virtual queue handling in this case, will allow further improvements to memcached.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 5.4&#039;&#039;====&lt;br /&gt;
Apache keeps fairly a equal amount of throughput up to 36 cores. Then it slopes downwards. At 48 cores it still has an improved throughput of more than 12 times. Apache like memcached is limited by hardware. At higher core count the network card simply cannot handle the number of packets, and the FIFO queue it holds for them overflows.&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 5.6&#039;&#039;====&lt;br /&gt;
gmakes run time is nearly unchanged by the implementation changes presented in this paper. This is largely because the program has serial sections of code and some processes that finish somewhat later then all others, which prevents perfect scalability. Even then it achieves the greatest level of scalability out of the three programs (35X speed up for 48 cores.)&lt;br /&gt;
&lt;br /&gt;
====No reason to give up traditional kernel design====&lt;br /&gt;
&lt;br /&gt;
The contribution of this paper is a lot of research that has focus upon techniques and methods for scalability. This is accomplished through programming of applications alongside kernel programming. This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. In looking at the issue of scalability it is important to note the causes of the factors which hinder scalability.&lt;br /&gt;
&lt;br /&gt;
It has been shown that simple scaling techniques can be effective in increasing scalability. The authors looked at three different approaches to removing the bottlenecks within the system. The first was to see if there were issues within the linux kernel application, the second was to identify issues with the application design and the third was to address how the application interacts with the linux kernel services. Through this approach, the authors were able to quickly identify problems such as bottlenecks and apply simple techniques in fixing the issues at hand to reap some beneficial aspects. Some of the sections listed provide insight into the improvements that can be reaped from these optimizations.&lt;br /&gt;
&lt;br /&gt;
Through this research on various techniques as listed above, it was determined by the authors that the Linux kernel itself has many incorporated techniques used to improve scalability. The authors actually go on to speculate that &amp;quot;perhaps it is the case that Linux&#039;s system-call API is well suited to an implementation that avoids unnecessary contention over kernel objects.&amp;quot; This tends to show that the work in the Linux community has improved Linux a large amount and is current with modern techniques for optimization. It could also be interpreted from the paper that it may be to the benefit of the community to change how the applications are programmed rather than make changes to the Linux kernel in order to make scalability improvements. This may indicate that what has come before is done quite well when considering that the Linux kernel optimizations showed more improvement when put in conjunction with the application improvements.&lt;br /&gt;
&lt;br /&gt;
 The hypothesis of the paper was the viability of traditional kernel design. This should explicitly tie into that.&lt;br /&gt;
 Also the fact that more then half of the programs were limited by IO should be mentioned.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
Aside from a few acronyms that were unexplained (e.g. Linux TLB) the paper has no real stylistic problems.&lt;br /&gt;
&lt;br /&gt;
===memcached: &#039;&#039;Section 5.3&#039;&#039;===&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that there is no great problem with running multiple servers on one machine [3].&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 5.4&#039;&#039;===&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware.&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 5.6&#039;&#039;===&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
===Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;===&lt;br /&gt;
Given that all tests are more or less fair for the purposes of the benchmarks. They would support the Hypothesis that Linux can be made to scale, at least to 48 cores. Thus the conclusion is fair iff the rest of the paper is fair.&lt;br /&gt;
&lt;br /&gt;
 Now you just have to fill in how fair the rest of the paper is.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
[7] MOSBENCH: http://pdos.csail.mit.edu/mosbench/&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6670</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6670"/>
		<updated>2010-12-03T02:53:08Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Apache: Section 5.4 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*contribution (conclusion mostly): just need to re-work some sections and follow the cues I left in the Conclusion section. &lt;br /&gt;
*critique (conclusion mostly): critique the conclusion of the essay&lt;br /&gt;
*style: the style section is largely untouched. Daniel and I([[Rannath]]) have puts some thoughts there, but that section needs to be made into sentences.&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts - DONE!!!==&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem - DONE!!!==&lt;br /&gt;
As technology progresses, the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;. The problem with a standard Linux OS is that they are not designed for massive scalability, which will soon prove to be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic states that the situation makes sense because there are 48 cores dividing the work, the information should be processed as fast as possible with each core doing as much work as possible.&lt;br /&gt;
&lt;br /&gt;
To fix those scalability issues, it is necessary to focus on three major areas: the Linux kernel, user level design and how applications use kernel services. The Linux kernel can be improved by optimizing sharing and use the current advantages of recent improvement to scalability features. On the user level design, applications can be improved so that there is more focus on parallelism since some programs have not implemented those improved features. The final aspect of improving scalability is how an application uses kernel services to better share resources so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found easily and actually only take simple changes to correct or avoid.&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This research uses a foundation of previous research discovered during the development of scalability in UNIX systems. The major developments from shared memory machines&amp;lt;sup&amp;gt;[[#Foot2|2]]&amp;lt;/sup&amp;gt; and wait-free synchronization to fast message passing ended up creating a base set of techniques, which can be used to improve scalability. These techniques have been incorporated in all major operating system including Linux, Mac OS X and Windows. Linux has been improved with kernel subsystems, such as Read-Copy-Update, which is an algorithm that is used to avoid locks and atomic instructions which affect scalability.&amp;lt;sup&amp;gt;[[#Foot3|3]]&amp;lt;/sup&amp;gt; There is an excellent base of research on Linux scalability studies that have already been written, on which this research paper can model its testing standards. These papers include research on improving scalability on a 32-core machine.&amp;lt;sup&amp;gt;[[#Foot4|4]]&amp;lt;/sup&amp;gt; In addition, the base of studies can be used to improve the results of these experiments by learning from the previous results. This research may also aid in identifying bottlenecks which speed up creating solutions for those problems.&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
All contributions in this paper are the result of the identification and removal or marginalization of bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;===&lt;br /&gt;
*The percentage of serialization in a program has a lot to do with how much an application can be sped up. This is Amdahl&#039;s Law&lt;br /&gt;
** Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the proportion of the program that cannot be made parallel (e.g. 25%(.25) non-parallel --&amp;gt; limit of 4x speedup) (I can&#039;t get this to sound right someone fix it please -[[Rannath]] &amp;lt;- I will fix [[Daniel B.]]&lt;br /&gt;
*Types of serializing interactions found in the MOSBENCH apps:	 &lt;br /&gt;
**Locking of shared data structure as the number of cores increase leads to an increase in lock wait time	 &lt;br /&gt;
**Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol	 &lt;br /&gt;
**Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate	 &lt;br /&gt;
**Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources	 &lt;br /&gt;
**Not enough tasks for cores leads to idle cores&lt;br /&gt;
&lt;br /&gt;
===Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;===&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering[1] to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering[2] for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[1] J. Corbet. Receive Packet Steering, November 2009. http://lwn.net/Articles/362339/.&lt;br /&gt;
&lt;br /&gt;
[2] J. Edge. Receive Flow Steering, April 2010. http://lwn.net/Articles/382428/.&lt;br /&gt;
&lt;br /&gt;
===Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;===&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
===Lock-free comparison &amp;amp; Avoiding unnecessary locking: &#039;&#039;Section 4.4 &amp;amp; 4.7&#039;&#039;===&lt;br /&gt;
The traditional Linux kernel has very low scalability for name lookups in the directory entry cache. This means there is reduced performance in returning information pertaining to a specific file path when there are multiple threads trying to access files in common parent directories due to the kernel serializing the process. This problem is solved in the patched kernel by introducing a new counter to keep track of threads actively looking at the directory entry cache. If a certain thread threatens an entry currently in use by another, the default locking protocol is used to avoid race conditions. If the activities have no bearing on each other the situation is rightfully ignored allowing for much faster access to different different entries in the directory entry cache.&lt;br /&gt;
&lt;br /&gt;
There are many other locks/mutexes that have special cases where they don&#039;t need to lock. Others can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;===&lt;br /&gt;
Three centralized data structures were causing bottlenecks due to lock contention - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
===Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;===&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line, but different variables. With two different cores requesting the same variable one reading, the other writing there was a severe bottleneck. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===memcached: &#039;&#039;Section 5.3&#039;&#039;===&lt;br /&gt;
memcached nearly doubles throughput at 48 cores on the paper&#039;s implementation over the stock implementation. After improvements to the Linux kernel memcached is limited by hardware. Improvements to hardware scalability, virtual queue handling in this case, will allow further improvements to memcached.&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 5.4&#039;&#039;===&lt;br /&gt;
Apache keeps fairly a equal amount of throughput up to 36 cores. Then it slopes downwards. At 48 cores it still has an improved throughput of more than 12 times. Apache like memcached is limited by hardware. At higher core count the network card simply cannot handle the number of packets, and the FIFO queue it holds for them overflows.&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 5.6&#039;&#039;===&lt;br /&gt;
gmakes run time is nearly unchanged by the implementation changes presented in this paper. This is largely because the program has serial sections of code and some processes that finish somewhat later then all others, which prevents perfect scalability. Even then it achieves the greatest level of scalability out of the three programs (35X speed up for 48 cores.)&lt;br /&gt;
&lt;br /&gt;
===Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;===&lt;br /&gt;
The contribution of this paper is a lot of research that has focus upon techniques and methods for scalability. This is accomplished through programming of applications alongside kernel programming. This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. In looking at the issue of scalability it is important to note the causes of the factors which hinder scalability.&lt;br /&gt;
&lt;br /&gt;
It has been shown that simple scaling techniques can be effective in increasing scalability. The authors looked at three different approaches to removing the bottlenecks within the system. The first was to see if there were issues within the linux kernel application, the second was to identify issues with the application design and the third was to address how the application interacts with the linux kernel services. Through this approach, the authors were able to quickly identify problems such as bottlenecks and apply simple techniques in fixing the issues at hand to reap some beneficial aspects. Some of the sections listed provide insight into the improvements that can be reaped from these optimizations.&lt;br /&gt;
&lt;br /&gt;
Through this research on various techniques as listed above, it was determined by the authors that the Linux kernel itself has many incorporated techniques used to improve scalability. The authors actually go on to speculate that &amp;quot;perhaps it is the case that Linux&#039;s system-call API is well suited to an implementation that avoids unnecessary contention over kernel objects.&amp;quot; This tends to show that the work in the Linux community has improved Linux a large amount and is current with modern techniques for optimization. It could also be interpreted from the paper that it may be to the benefit of the community to change how the applications are programmed rather than make changes to the Linux kernel in order to make scalability improvements. This may indicate that what has come before is done quite well when considering that the Linux kernel optimizations showed more improvement when put in conjunction with the application improvements.&lt;br /&gt;
&lt;br /&gt;
 That is just the software that we can change, The programs are also limited by IO hardware.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
Aside from a few acronyms that were unexplained (e.g. Linux TLB) the paper has no real stylistic problems.&lt;br /&gt;
&lt;br /&gt;
===memcached: &#039;&#039;Section 5.3&#039;&#039;===&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that there is no great problem with running multiple servers on one machine [3].&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 5.4&#039;&#039;===&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*not the same thing. The article you pointed out talks about spawning different processes to accept connections which apache already does by default. what the article says is they had a separate instance of apache on each core each capable of spawning those processes. if you read over section 3.3 again u&#039;ll notice it says apache spawns a process for each incoming connection&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 5.6&#039;&#039;===&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
===Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;===&lt;br /&gt;
Given that all tests are more or less fair for the purposes of the benchmarks. They would support the Hypothesis that Linux can be made to scale, at least to 48 cores. Thus the conclusion is fair iff the rest of the paper is fair.&lt;br /&gt;
&lt;br /&gt;
 Now you just have to fill in how fair the rest of the paper is.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6665</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6665"/>
		<updated>2010-12-03T02:42:35Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Apache: Section 5.4 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*contribution (conclusion mostly): just need to re-work some sections and follow the cues I left in the Conclusion section. &lt;br /&gt;
*critique (conclusion mostly): critique the conclusion of the essay&lt;br /&gt;
*style: the style section is largely untouched. Daniel and I([[Rannath]]) have puts some thoughts there, but that section needs to be made into sentences.&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts - DONE!!!==&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem - DONE!!!==&lt;br /&gt;
As technology progresses, the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;. The problem with a standard Linux OS is that they are not designed for massive scalability, which will soon prove to be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic states that the situation makes sense because there are 48 cores dividing the work, the information should be processed as fast as possible with each core doing as much work as possible.&lt;br /&gt;
&lt;br /&gt;
To fix those scalability issues, it is necessary to focus on three major areas: the Linux kernel, user level design and how applications use kernel services. The Linux kernel can be improved by optimizing sharing and use the current advantages of recent improvement to scalability features. On the user level design, applications can be improved so that there is more focus on parallelism since some programs have not implemented those improved features. The final aspect of improving scalability is how an application uses kernel services to better share resources so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found easily and actually only take simple changes to correct or avoid.&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This research uses a foundation of previous research discovered during the development of scalability in UNIX systems. The major developments from shared memory machines&amp;lt;sup&amp;gt;[[#Foot2|2]]&amp;lt;/sup&amp;gt; and wait-free synchronization to fast message passing ended up creating a base set of techniques, which can be used to improve scalability. These techniques have been incorporated in all major operating system including Linux, Mac OS X and Windows. Linux has been improved with kernel subsystems, such as Read-Copy-Update, which is an algorithm that is used to avoid locks and atomic instructions which affect scalability.&amp;lt;sup&amp;gt;[[#Foot3|3]]&amp;lt;/sup&amp;gt; There is an excellent base of research on Linux scalability studies that have already been written, on which this research paper can model its testing standards. These papers include research on improving scalability on a 32-core machine.&amp;lt;sup&amp;gt;[[#Foot4|4]]&amp;lt;/sup&amp;gt; In addition, the base of studies can be used to improve the results of these experiments by learning from the previous results. This research may also aid in identifying bottlenecks which speed up creating solutions for those problems.&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
All contributions in this paper are the result of the identification and removal or marginalization of bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;===&lt;br /&gt;
*The percentage of serialization in a program has a lot to do with how much an application can be sped up. This is Amdahl&#039;s Law&lt;br /&gt;
** Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the proportion of the program that cannot be made parallel (e.g. 25%(.25) non-parallel --&amp;gt; limit of 4x speedup) (I can&#039;t get this to sound right someone fix it please -[[Rannath]] &amp;lt;- I will fix [[Daniel B.]]&lt;br /&gt;
*Types of serializing interactions found in the MOSBENCH apps:	 &lt;br /&gt;
**Locking of shared data structure as the number of cores increase leads to an increase in lock wait time	 &lt;br /&gt;
**Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol	 &lt;br /&gt;
**Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate	 &lt;br /&gt;
**Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources	 &lt;br /&gt;
**Not enough tasks for cores leads to idle cores&lt;br /&gt;
&lt;br /&gt;
===Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;===&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering[1] to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering[2] for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[1] J. Corbet. Receive Packet Steering, November 2009. http://lwn.net/Articles/362339/.&lt;br /&gt;
&lt;br /&gt;
[2] J. Edge. Receive Flow Steering, April 2010. http://lwn.net/Articles/382428/.&lt;br /&gt;
&lt;br /&gt;
===Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;===&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
===Lock-free comparison &amp;amp; Avoiding unnecessary locking: &#039;&#039;Section 4.4 &amp;amp; 4.7&#039;&#039;===&lt;br /&gt;
The traditional Linux kernel has very low scalability for name lookups in the directory entry cache. This means there is reduced performance in returning information pertaining to a specific file path when there are multiple threads trying to access files in common parent directories due to the kernel serializing the process. This problem is solved in the patched kernel by introducing a new counter to keep track of threads actively looking at the directory entry cache. If a certain thread threatens an entry currently in use by another, the default locking protocol is used to avoid race conditions. If the activities have no bearing on each other the situation is rightfully ignored allowing for much faster access to different different entries in the directory entry cache.&lt;br /&gt;
&lt;br /&gt;
There are many other locks/mutexes that have special cases where they don&#039;t need to lock. Others can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;===&lt;br /&gt;
Three centralized data structures were causing bottlenecks due to lock contention - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
===Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;===&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line, but different variables. With two different cores requesting the same variable one reading, the other writing there was a severe bottleneck. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===memcached: &#039;&#039;Section 5.3&#039;&#039;===&lt;br /&gt;
memcached nearly doubles throughput at 48 cores on the paper&#039;s implementation over the stock implementation. After improvements to the Linux kernel memcached is limited by hardware. Improvements to hardware scalability, virtual queue handling in this case, will allow further improvements to memcached.&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 5.4&#039;&#039;===&lt;br /&gt;
Apache keeps fairly a equal amount of throughput up to 36 cores. Then it slopes downwards. At 48 cores it still has an improved throughput of more than 12 times. Apache like memcached is limited by hardware. At higher core count the network card simply cannot handle the number of packets, and the FIFO queue it holds for them overflows.&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 5.6&#039;&#039;===&lt;br /&gt;
gmakes run time is nearly unchanged by the implementation changes presented in this paper. This is largely because the program has serial sections of code and some processes that finish somewhat later then all others, which prevents perfect scalability. Even then it achieves the greatest level of scalability out of the three programs (35X speed up for 48 cores.)&lt;br /&gt;
&lt;br /&gt;
===Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;===&lt;br /&gt;
The contribution of this paper is a lot of research that has focus upon techniques and methods for scalability. This is accomplished through programming of applications alongside kernel programming. This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. In looking at the issue of scalability it is important to note the causes of the factors which hinder scalability.&lt;br /&gt;
&lt;br /&gt;
It has been shown that simple scaling techniques can be effective in increasing scalability. The authors looked at three different approaches to removing the bottlenecks within the system. The first was to see if there were issues within the linux kernel application, the second was to identify issues with the application design and the third was to address how the application interacts with the linux kernel services. Through this approach, the authors were able to quickly identify problems such as bottlenecks and apply simple techniques in fixing the issues at hand to reap some beneficial aspects. Some of the sections listed provide insight into the improvements that can be reaped from these optimizations.&lt;br /&gt;
&lt;br /&gt;
Through this research on various techniques as listed above, it was determined by the authors that the Linux kernel itself has many incorporated techniques used to improve scalability. The authors actually go on to speculate that &amp;quot;perhaps it is the case that Linux&#039;s system-call API is well suited to an implementation that avoids unnecessary contention over kernel objects.&amp;quot; This tends to show that the work in the Linux community has improved Linux a large amount and is current with modern techniques for optimization. It could also be interpreted from the paper that it may be to the benefit of the community to change how the applications are programmed rather than make changes to the Linux kernel in order to make scalability improvements. This may indicate that what has come before is done quite well when considering that the Linux kernel optimizations showed more improvement when put in conjunction with the application improvements.&lt;br /&gt;
&lt;br /&gt;
 That is just the software that we can change, The programs are also limited by IO hardware.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
Aside from a few acronyms that were unexplained (e.g. Linux TLB) the paper has no real stylistic problems.&lt;br /&gt;
&lt;br /&gt;
===memcached: &#039;&#039;Section 5.3&#039;&#039;===&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that there is no great problem with running multiple servers on one machine [3].&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 5.4&#039;&#039;===&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. &#039;&#039;&lt;br /&gt;
*This is untrue, the &#039;&#039;&#039;instance&#039;&#039;&#039; spawned one &#039;&#039;&#039;process&#039;&#039;&#039; per-core&lt;br /&gt;
*&amp;quot;Thus, for stock Linux, we run a separate instance of Apache per core with each server running on a distinct port&amp;quot; second sentence into 5.4 -kirill&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 5.6&#039;&#039;===&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
===Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;===&lt;br /&gt;
Given that all tests are more or less fair for the purposes of the benchmarks. They would support the Hypothesis that Linux can be made to scale, at least to 48 cores. Thus the conclusion is fair iff the rest of the paper is fair.&lt;br /&gt;
&lt;br /&gt;
 Now you just have to fill in how fair the rest of the paper is.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6664</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6664"/>
		<updated>2010-12-03T02:40:15Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Apache: Section 5.4 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*contribution (conclusion mostly): just need to re-work some sections and follow the cues I left in the Conclusion section. &lt;br /&gt;
*critique (conclusion mostly): critique the conclusion of the essay&lt;br /&gt;
*style: the style section is largely untouched. Daniel and I([[Rannath]]) have puts some thoughts there, but that section needs to be made into sentences.&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts - DONE!!!==&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem - DONE!!!==&lt;br /&gt;
As technology progresses, the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;. The problem with a standard Linux OS is that they are not designed for massive scalability, which will soon prove to be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic states that the situation makes sense because there are 48 cores dividing the work, the information should be processed as fast as possible with each core doing as much work as possible.&lt;br /&gt;
&lt;br /&gt;
To fix those scalability issues, it is necessary to focus on three major areas: the Linux kernel, user level design and how applications use kernel services. The Linux kernel can be improved by optimizing sharing and use the current advantages of recent improvement to scalability features. On the user level design, applications can be improved so that there is more focus on parallelism since some programs have not implemented those improved features. The final aspect of improving scalability is how an application uses kernel services to better share resources so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found easily and actually only take simple changes to correct or avoid.&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This research uses a foundation of previous research discovered during the development of scalability in UNIX systems. The major developments from shared memory machines&amp;lt;sup&amp;gt;[[#Foot2|2]]&amp;lt;/sup&amp;gt; and wait-free synchronization to fast message passing ended up creating a base set of techniques, which can be used to improve scalability. These techniques have been incorporated in all major operating system including Linux, Mac OS X and Windows. Linux has been improved with kernel subsystems, such as Read-Copy-Update, which is an algorithm that is used to avoid locks and atomic instructions which affect scalability.&amp;lt;sup&amp;gt;[[#Foot3|3]]&amp;lt;/sup&amp;gt; There is an excellent base of research on Linux scalability studies that have already been written, on which this research paper can model its testing standards. These papers include research on improving scalability on a 32-core machine.&amp;lt;sup&amp;gt;[[#Foot4|4]]&amp;lt;/sup&amp;gt; In addition, the base of studies can be used to improve the results of these experiments by learning from the previous results. This research may also aid in identifying bottlenecks which speed up creating solutions for those problems.&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
All contributions in this paper are the result of the identification and removal or marginalization of bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;===&lt;br /&gt;
*The percentage of serialization in a program has a lot to do with how much an application can be sped up. This is Amdahl&#039;s Law&lt;br /&gt;
** Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the proportion of the program that cannot be made parallel (e.g. 25%(.25) non-parallel --&amp;gt; limit of 4x speedup) (I can&#039;t get this to sound right someone fix it please -[[Rannath]] &amp;lt;- I will fix [[Daniel B.]]&lt;br /&gt;
*Types of serializing interactions found in the MOSBENCH apps:	 &lt;br /&gt;
**Locking of shared data structure as the number of cores increase leads to an increase in lock wait time	 &lt;br /&gt;
**Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol	 &lt;br /&gt;
**Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate	 &lt;br /&gt;
**Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources	 &lt;br /&gt;
**Not enough tasks for cores leads to idle cores&lt;br /&gt;
&lt;br /&gt;
===Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;===&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering[1] to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering[2] for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[1] J. Corbet. Receive Packet Steering, November 2009. http://lwn.net/Articles/362339/.&lt;br /&gt;
&lt;br /&gt;
[2] J. Edge. Receive Flow Steering, April 2010. http://lwn.net/Articles/382428/.&lt;br /&gt;
&lt;br /&gt;
===Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;===&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
===Lock-free comparison &amp;amp; Avoiding unnecessary locking: &#039;&#039;Section 4.4 &amp;amp; 4.7&#039;&#039;===&lt;br /&gt;
The traditional Linux kernel has very low scalability for name lookups in the directory entry cache. This means there is reduced performance in returning information pertaining to a specific file path when there are multiple threads trying to access files in common parent directories due to the kernel serializing the process. This problem is solved in the patched kernel by introducing a new counter to keep track of threads actively looking at the directory entry cache. If a certain thread threatens an entry currently in use by another, the default locking protocol is used to avoid race conditions. If the activities have no bearing on each other the situation is rightfully ignored allowing for much faster access to different different entries in the directory entry cache.&lt;br /&gt;
&lt;br /&gt;
There are many other locks/mutexes that have special cases where they don&#039;t need to lock. Others can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;===&lt;br /&gt;
Three centralized data structures were causing bottlenecks due to lock contention - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
===Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;===&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line, but different variables. With two different cores requesting the same variable one reading, the other writing there was a severe bottleneck. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===memcached: &#039;&#039;Section 5.3&#039;&#039;===&lt;br /&gt;
memcached nearly doubles throughput at 48 cores on the paper&#039;s implementation over the stock implementation. After improvements to the Linux kernel memcached is limited by hardware. Improvements to hardware scalability, virtual queue handling in this case, will allow further improvements to memcached.&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 5.4&#039;&#039;===&lt;br /&gt;
Apache keeps fairly a equal amount of throughput up to 36 cores. Then it slopes downwards. At 48 cores it still has an improved throughput of more than 12 times. Apache like memcached is limited by hardware. At higher core count the network card simply cannot handle the number of packets, and the FIFO queue it holds for them overflows.&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 5.6&#039;&#039;===&lt;br /&gt;
gmakes run time is nearly unchanged by the implementation changes presented in this paper. This is largely because the program has serial sections of code and some processes that finish somewhat later then all others, which prevents perfect scalability. Even then it achieves the greatest level of scalability out of the three programs (35X speed up for 48 cores.)&lt;br /&gt;
&lt;br /&gt;
===Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;===&lt;br /&gt;
The contribution of this paper is a lot of research that has focus upon techniques and methods for scalability. This is accomplished through programming of applications alongside kernel programming. This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. In looking at the issue of scalability it is important to note the causes of the factors which hinder scalability.&lt;br /&gt;
&lt;br /&gt;
It has been shown that simple scaling techniques can be effective in increasing scalability. The authors looked at three different approaches to removing the bottlenecks within the system. The first was to see if there were issues within the linux kernel application, the second was to identify issues with the application design and the third was to address how the application interacts with the linux kernel services. Through this approach, the authors were able to quickly identify problems such as bottlenecks and apply simple techniques in fixing the issues at hand to reap some beneficial aspects. Some of the sections listed provide insight into the improvements that can be reaped from these optimizations.&lt;br /&gt;
&lt;br /&gt;
Through this research on various techniques as listed above, it was determined by the authors that the Linux kernel itself has many incorporated techniques used to improve scalability. The authors actually go on to speculate that &amp;quot;perhaps it is the case that Linux&#039;s system-call API is well suited to an implementation that avoids unnecessary contention over kernel objects.&amp;quot; This tends to show that the work in the Linux community has improved Linux a large amount and is current with modern techniques for optimization. It could also be interpreted from the paper that it may be to the benefit of the community to change how the applications are programmed rather than make changes to the Linux kernel in order to make scalability improvements. This may indicate that what has come before is done quite well when considering that the Linux kernel optimizations showed more improvement when put in conjunction with the application improvements.&lt;br /&gt;
&lt;br /&gt;
 That is just the software that we can change, The programs are also limited by IO hardware.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
Aside from a few acronyms that were unexplained (e.g. Linux TLB) the paper has no real stylistic problems.&lt;br /&gt;
&lt;br /&gt;
===memcached: &#039;&#039;Section 5.3&#039;&#039;===&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that there is no great problem with running multiple servers on one machine [3].&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 5.4&#039;&#039;===&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. &#039;&#039;&lt;br /&gt;
*This is untrue, the &#039;&#039;&#039;instance&#039;&#039;&#039; spawned one &#039;&#039;&#039;process&#039;&#039;&#039; per-core&lt;br /&gt;
*&amp;quot;Thus, for stock Linux, we run a separate of Apache per core with each server running on a distinct port&amp;quot; second sentance into 5.4 -kirill&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 5.6&#039;&#039;===&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
===Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;===&lt;br /&gt;
Given that all tests are more or less fair for the purposes of the benchmarks. They would support the Hypothesis that Linux can be made to scale, at least to 48 cores. Thus the conclusion is fair iff the rest of the paper is fair.&lt;br /&gt;
&lt;br /&gt;
 Now you just have to fill in how fair the rest of the paper is.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6635</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6635"/>
		<updated>2010-12-03T02:01:39Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Lock-free comparison: Section 4.4 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*contribution (conclusion mostly): just need to re-work some sections and follow the cues I left in the Conclusion section. &lt;br /&gt;
*critique (conclusion mostly): critique the conclusion of the essay&lt;br /&gt;
*style: the style section is largely untouched. Daniel and I([[Rannath]]) have puts some thoughts there, but that section needs to be made into sentences.&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts - DONE!!!==&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem - DONE!!!==&lt;br /&gt;
As technology progresses, the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;. The problem with a standard Linux OS is that they are not designed for massive scalability, which will soon prove to be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic states that the situation makes sense because there are 48 cores dividing the work, the information should be processed as fast as possible with each core doing as much work as possible.&lt;br /&gt;
&lt;br /&gt;
To fix those scalability issues, it is necessary to focus on three major areas: the Linux kernel, user level design and how applications use kernel services. The Linux kernel can be improved by optimizing sharing and use the current advantages of recent improvement to scalability features. On the user level design, applications can be improved so that there is more focus on parallelism since some programs have not implemented those improved features. The final aspect of improving scalability is how an application uses kernel services to better share resources so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found easily and actually only take simple changes to correct or avoid.&amp;lt;sup&amp;gt;[[#Foot1|1]]&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This research uses a foundation of previous research discovered during the development of scalability in UNIX systems. The major developments from shared memory machines&amp;lt;sup&amp;gt;[[#Foot2|2]]&amp;lt;/sup&amp;gt; and wait-free synchronization to fast message passing ended up creating a base set of techniques, which can be used to improve scalability. These techniques have been incorporated in all major operating system including Linux, Mac OS X and Windows. Linux has been improved with kernel subsystems, such as Read-Copy-Update, which is an algorithm that is used to avoid locks and atomic instructions which affect scalability.&amp;lt;sup&amp;gt;[[#Foot3|3]]&amp;lt;/sup&amp;gt; There is an excellent base of research on Linux scalability studies that have already been written, on which this research paper can model its testing standards. These papers include research on improving scalability on a 32-core machine.&amp;lt;sup&amp;gt;[[#Foot4|4]]&amp;lt;/sup&amp;gt; In addition, the base of studies can be used to improve the results of these experiments by learning from the previous results. This research may also aid in identifying bottlenecks which speed up creating solutions for those problems.&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
All contributions in this paper are the result of the identification and removal or marginalization of bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;===&lt;br /&gt;
*The percentage of serialization in a program has a lot to do with how much an application can be sped up. This is Amdahl&#039;s Law&lt;br /&gt;
** Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the proportion of the program that cannot be made parallel (e.g. 25%(.25) non-parallel --&amp;gt; limit of 4x speedup) (I can&#039;t get this to sound right someone fix it please -[[Rannath]] &amp;lt;- I will fix [[Daniel B.]]&lt;br /&gt;
*Types of serializing interactions found in the MOSBENCH apps:	 &lt;br /&gt;
**Locking of shared data structure as the number of cores increase leads to an increase in lock wait time	 &lt;br /&gt;
**Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol	 &lt;br /&gt;
**Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate	 &lt;br /&gt;
**Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources	 &lt;br /&gt;
**Not enough tasks for cores leads to idle cores&lt;br /&gt;
&lt;br /&gt;
===Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;===&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering[1] to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering[2] for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[1] J. Corbet. Receive Packet Steering, November 2009. http://lwn.net/Articles/362339/.&lt;br /&gt;
&lt;br /&gt;
[2] J. Edge. Receive Flow Steering, April 2010. http://lwn.net/Articles/382428/.&lt;br /&gt;
&lt;br /&gt;
===Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;===&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
===Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;===&lt;br /&gt;
The traditional Linux kernel has very low scalability for name lookups in the directory entry cache. This means there is reduced performance in returning information pertaining to a specific file path when there are multiple threads trying to access files in common parent directories due to the kernel serializing the process. This problem is solved in the patched kernel by introducing a new counter to keep track of threads actively looking at the directory entry cache. If a certain thread threatens an entry currently in use by another, the default locking protocol is used to avoid race conditions. If the activities have no bearing on each other the situation is rightfully ignored allowing for much faster access to different different entries in the directory entry cache.&lt;br /&gt;
&lt;br /&gt;
===Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;===&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
===Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;===&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;===&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===memcached: &#039;&#039;Section 5.3&#039;&#039;===&lt;br /&gt;
After improvements to the Linux kernel memcached is Limited by hardware. Improvements to hardware scalability will allow further improvements to memcached.&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 5.4&#039;&#039;===&lt;br /&gt;
Apache like memcached is limited by hardware. At higher core count the network card simply cannot handle the number of packets, the FIFO queue over flows.&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 5.6&#039;&#039;===&lt;br /&gt;
&lt;br /&gt;
===Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;===&lt;br /&gt;
The contribution of this paper is a lot of research that has focus upon techniques and methods for scalability. This is accomplished through programming of applications alongside kernel programming. This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. In looking at the issue of scalability it is important to note the causes of the factors which hinder scalability.&lt;br /&gt;
&lt;br /&gt;
It has been shown that simple scaling techniques can be effective in increasing scalability. The authors looked at three different approaches to removing the bottlenecks within the system. The first was to see if there were issues within the linux kernel application, the second was to identify issues with the application design and the third was to address how the application interacts with the linux kernel services. Through this approach, the authors were able to quickly identify problems such as bottlenecks and apply simple techniques in fixing the issues at hand to reap some beneficial aspects. Some of the sections listed provide insight into the improvements that can be reaped from these optimizations.&lt;br /&gt;
&lt;br /&gt;
Through this research on various techniques as listed above, it was determined by the authors that the Linux kernel itself has many incorporated techniques used to improve scalability. The authors actually go on to speculate that &amp;quot;perhaps it is the case that Linux&#039;s system-call API is well suited to an implementation that avoids unnecessary contention over kernel objects.&amp;quot; This tends to show that the work in the Linux community has improved Linux a large amount and is current with modern techniques for optimization. It could also be interpreted from the paper that it may be to the benefit of the community to change how the applications are programmed rather than make changes to the Linux kernel in order to make scalability improvements. This may indicate that what has come before is done quite well when considering that the Linux kernel optimizations showed more improvement when put in conjunction with the application improvements.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content;&lt;br /&gt;
 be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
 Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
===Content(Fairness): &#039;&#039;Section 5&#039;&#039;===&lt;br /&gt;
&lt;br /&gt;
====memcached: &#039;&#039;Section 5.3&#039;&#039;====&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that running multiple instances of the server is the same, or nearly the same, as running one larger server [3]. In the end memcached was bottlenecked by the network card.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 5.4&#039;&#039;====&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Which is not a problem as the paper specifically states that they are testing what they can improve in spite of hardware limitation.&#039;&#039; - [[Rannath]]&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 5.6&#039;&#039;====&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
====Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;====&lt;br /&gt;
Given that all tests are more or less fair for the purposes of the benchmarks. They would support the Hypothesis that Linux can be made to scale, at least to 48 cores. Thus the conclusion is fair iff the rest of the paper is fair.&lt;br /&gt;
&lt;br /&gt;
 Now you just have to fill in how fair the rest of the paper is.&lt;br /&gt;
&lt;br /&gt;
===Style===&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing? Wrong?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
Everything seems to be in logical order. I couldn&#039;t find any needless info. Nothing inherently confusing or wrong. Nothing bad on the grammar front either. - Rannath&lt;br /&gt;
&lt;br /&gt;
Some acronyms aren&#039;t explained before they are used, so some people reading the paper may get confused as to what they mean (e.g. Linux TLB). Since this paper is meant to be formal, acronyms should be explained, with some exceptions like OS and IBM. - Daniel B.&lt;br /&gt;
&lt;br /&gt;
Your example has no impact on the paper, it was in the &amp;quot;look here for more info&amp;quot; section. Most people wouldn&#039;t know what a &amp;quot;translation look-aside buffer&amp;quot; is either.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6454</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6454"/>
		<updated>2010-12-02T17:52:14Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Multicore packet processing: Section 4.2 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*flesh out the whole lot&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* So here is the claims and unclaimed section. Add your name next to one if you want to take it on.&lt;br /&gt;
** gmake - Daniel B.&lt;br /&gt;
** memcached - Rannath&lt;br /&gt;
** Apache - Kirill&lt;br /&gt;
** [[(Exim, PostgreSQL, Metis, and Psearchy will not be needed as the professor said we only need to explain 3)]]&lt;br /&gt;
** Research Problem - Andrew&lt;br /&gt;
** Contribution - Rovic&lt;br /&gt;
** Essay Conclusion (also discussion) - Everyone&lt;br /&gt;
** Critic, Style - Everyone&lt;br /&gt;
** References - Everyone&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts - DONE!!!==&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem==&lt;br /&gt;
  my references are just below because it is easier for numbering the data later.&lt;br /&gt;
&lt;br /&gt;
As technological progress the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system[1]. The problem with a standard Linux operating is they are not designed for massive scalability which will soon be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic that situation makes sense because 48 cores are dividing the work. But when processing information a process the main goal is to finish so as long as possible every core should be doing a much work as possible.&lt;br /&gt;
  &lt;br /&gt;
To fix those scalability issues it is necessary to focus on three major areas: the Linux kernel, user level design and how application use of kernel services. The Linux kernel can be improved be to improve sharing and have the advantage of recent iterations are beginning to implement scalability features. On the user level design applications can be improved so that there is more focus on parallelism since some programs have not implements those improved features. The final aspect of improving scalability is how an application uses kernel services to share resources better so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found actually only take a little work to avoid.[1]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This research is based on much research which was created before in the development of scalability for UNIX system.  The major developments from shared memory machines [2], wait-free synchronization to fast message passing have created a base set of techniques which can be used to improve scalability. These techniques have been incorporated in all major operation system including Linux, Mac OS X and Windows.  Linux has been improved with kernel subsystems such as Read-Copy-Update which an algorithm for which is used to avoid locks and atomic instructions which lower scalability.[3] The is also an excellent base a research on Linux scalability studies to base this research paper. These paper include a on doing scalability on a 32-core machine. [4] That research can improve the results by learning from the experiments already performed by researchers. This research also aid identifying bottlenecks which speed up researching solutions for those bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[2] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pages 302–313,1994.&lt;br /&gt;
&lt;br /&gt;
[3] P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen, O. Krieger, and R. Russell. Read-copy-update.  In Proceedings of the Linux Symposium 2002, pages 338-367, Ottawa Ontario, June 2002&lt;br /&gt;
&lt;br /&gt;
[4] C. Yan, Y. Chen, and S. Yuanchun. OSMark: A benchmark suite for understanding parallel scalability of operating systems on large scale multi-cores. In 2009 2nd International Conference on Computer Science and Information Technology, pages 313–317, 2009&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
  &lt;br /&gt;
 - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
All contributions in this paper are the result of the identification and removal or marginalization of bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;===&lt;br /&gt;
*The percentage of serialization in a program has a lot to do with how much an application can be sped up. This is Amdahl&#039;s Law&lt;br /&gt;
** Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the proportion of the program that cannot be made parallel (e.g. 25%(.25) non-parallel --&amp;gt; limit of 4x speedup) (I can&#039;t get this to sound right someone fix it please -[[Rannath]] &amp;lt;- I will fix [[Daniel B.]]&lt;br /&gt;
*Types of serializing interactions found in the MOSBENCH apps:	 &lt;br /&gt;
**Locking of shared data structure as the number of cores increase leads to an increase in lock wait time	 &lt;br /&gt;
**Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol	 &lt;br /&gt;
**Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate	 &lt;br /&gt;
**Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources	 &lt;br /&gt;
**Not enough tasks for cores leads to idle cores&lt;br /&gt;
&lt;br /&gt;
===Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;===&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering[1] to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering[2] for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[1] J. Corbet. Receive Packet Steering, November 2009. http://lwn.net/Articles/362339/.&lt;br /&gt;
&lt;br /&gt;
[2] J. Edge. Receive Flow Steering, April 2010. http://lwn.net/Articles/382428/.&lt;br /&gt;
&lt;br /&gt;
===Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;===&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
===Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;===&lt;br /&gt;
This section describes a specific instance of unnecessary locking.&lt;br /&gt;
&lt;br /&gt;
===Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;===&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
===Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;===&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;===&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
==Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;==&lt;br /&gt;
===Work in Progress===&lt;br /&gt;
&lt;br /&gt;
====[[Rovic P.]]====&lt;br /&gt;
-I&#039;m just using this as a notepad, do not copy/paste this section, I will put in a properly written set of paragraphs which will fit with the contribution questions asked. -RP&lt;br /&gt;
&lt;br /&gt;
This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. This has also shown that scaling in application programming should be more the focus. It has been shown that simple scaling techniques (list techniques) such as programming parallelism (look up more stuff to back this up and quotes). (Sloppy counter effectiveness, possible positive contributions, what has been used (internet search), what hasn’t been used.) Read conclusion, 2nd paragraph.&lt;br /&gt;
&lt;br /&gt;
One reason the&lt;br /&gt;
required changes are modest is that stock Linux already&lt;br /&gt;
incorporates many modifications to improve scalability.&lt;br /&gt;
More speculatively, perhaps it is the case that Linux’s&lt;br /&gt;
system-call API is well suited to an implementation that&lt;br /&gt;
avoids unnecessary contention over kernel objects.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====[[Rannath]]====&lt;br /&gt;
Everything so far indicates that the MOSBENCH applications can scale to 48 cores. This scaling required a few modest changes to remove bottlenecks. The MIT team speculate that that trend will continue as the number of cores increase. They also state that things not bottlenecked by the CPU are harder to fix. &lt;br /&gt;
&lt;br /&gt;
We can eliminate most kernel bottlenecks that the applications hits most often with minor changes. Most changes were well known methodology, with the exception of Sloppy counters. This study is limited by the removal of the IO bottleneck, but it does suggest that traditional implementations can be made scalable.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content;&lt;br /&gt;
 be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
 Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
===Content(Fairness): &#039;&#039;Section 5&#039;&#039;===&lt;br /&gt;
&lt;br /&gt;
====memcached: &#039;&#039;Section 5.3&#039;&#039;====&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that running multiple instances of the server is the same, or nearly the same, as running one larger server [3]. In the end memcached was bottlenecked by the network card.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 5.4&#039;&#039;====&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Which is not a problem as the paper specifically states that they are testing what they can improve in spite of hardware limitation.&#039;&#039; - [[Rannath]]&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 5.6&#039;&#039;====&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
====Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;====&lt;br /&gt;
Given that all tests are more or less fair for the purposes of the benchmarks. They would support the Hypothesis that Linux can be made to scale, at least to 48 cores. Thus the conclusion is fair iff the rest of the paper is fair.&lt;br /&gt;
&lt;br /&gt;
 Now you just have to fill in how fair the rest of the paper is.&lt;br /&gt;
&lt;br /&gt;
===Style===&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing? Wrong?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
Everything seems to be in logical order. I couldn&#039;t find any needless info. Nothing inherently confusing or wrong. Nothing bad on the grammar front either. - Rannath&lt;br /&gt;
&lt;br /&gt;
Some acronyms aren&#039;t explained before they are used, so some people reading the paper may get confused as to what they mean (e.g. Linux TLB). Since this paper is meant to be formal, acronyms should be explained, with some exceptions like OS and IBM. - Daniel B.&lt;br /&gt;
&lt;br /&gt;
Your example has no impact on the paper, it was in the &amp;quot;look here for more info&amp;quot; section. Most people wouldn&#039;t know what a &amp;quot;translation look-aside buffer&amp;quot; is either.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6453</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6453"/>
		<updated>2010-12-02T17:48:27Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Multicore packet processing: Section 4.2 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*flesh out the whole lot&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* So here is the claims and unclaimed section. Add your name next to one if you want to take it on.&lt;br /&gt;
** gmake - Daniel B.&lt;br /&gt;
** memcached - Rannath&lt;br /&gt;
** Apache - Kirill&lt;br /&gt;
** [[(Exim, PostgreSQL, Metis, and Psearchy will not be needed as the professor said we only need to explain 3)]]&lt;br /&gt;
** Research Problem - Andrew&lt;br /&gt;
** Contribution - Rovic&lt;br /&gt;
** Essay Conclusion (also discussion) - Everyone&lt;br /&gt;
** Critic, Style - Everyone&lt;br /&gt;
** References - Everyone&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts - DONE!!!==&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem==&lt;br /&gt;
  my references are just below because it is easier for numbering the data later.&lt;br /&gt;
&lt;br /&gt;
As technological progress the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system[1]. The problem with a standard Linux operating is they are not designed for massive scalability which will soon be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic that situation makes sense because 48 cores are dividing the work. But when processing information a process the main goal is to finish so as long as possible every core should be doing a much work as possible.&lt;br /&gt;
  &lt;br /&gt;
To fix those scalability issues it is necessary to focus on three major areas: the Linux kernel, user level design and how application use of kernel services. The Linux kernel can be improved be to improve sharing and have the advantage of recent iterations are beginning to implement scalability features. On the user level design applications can be improved so that there is more focus on parallelism since some programs have not implements those improved features. The final aspect of improving scalability is how an application uses kernel services to share resources better so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found actually only take a little work to avoid.[1]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This research is based on much research which was created before in the development of scalability for UNIX system.  The major developments from shared memory machines [2], wait-free synchronization to fast message passing have created a base set of techniques which can be used to improve scalability. These techniques have been incorporated in all major operation system including Linux, Mac OS X and Windows.  Linux has been improved with kernel subsystems such as Read-Copy-Update which an algorithm for which is used to avoid locks and atomic instructions which lower scalability.[3] The is also an excellent base a research on Linux scalability studies to base this research paper. These paper include a on doing scalability on a 32-core machine. [4] That research can improve the results by learning from the experiments already performed by researchers. This research also aid identifying bottlenecks which speed up researching solutions for those bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[2] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pages 302–313,1994.&lt;br /&gt;
&lt;br /&gt;
[3] P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen, O. Krieger, and R. Russell. Read-copy-update.  In Proceedings of the Linux Symposium 2002, pages 338-367, Ottawa Ontario, June 2002&lt;br /&gt;
&lt;br /&gt;
[4] C. Yan, Y. Chen, and S. Yuanchun. OSMark: A benchmark suite for understanding parallel scalability of operating systems on large scale multi-cores. In 2009 2nd International Conference on Computer Science and Information Technology, pages 313–317, 2009&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
  &lt;br /&gt;
 - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
All contributions in this paper are the result of the identification and removal or marginalization of bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;===&lt;br /&gt;
*The percentage of serialization in a program has a lot to do with how much an application can be sped up. This is Amdahl&#039;s Law&lt;br /&gt;
** Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the proportion of the program that cannot be made parallel (e.g. 25%(.25) non-parallel --&amp;gt; limit of 4x speedup) (I can&#039;t get this to sound right someone fix it please -[[Rannath]] &amp;lt;- I will fix [[Daniel B.]]&lt;br /&gt;
*Types of serializing interactions found in the MOSBENCH apps:	 &lt;br /&gt;
**Locking of shared data structure as the number of cores increase leads to an increase in lock wait time	 &lt;br /&gt;
**Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol	 &lt;br /&gt;
**Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate	 &lt;br /&gt;
**Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources	 &lt;br /&gt;
**Not enough tasks for cores leads to idle cores&lt;br /&gt;
&lt;br /&gt;
===Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;===&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering&amp;lt;ref&amp;gt;J. Corbet. Receive Packet Steering, November 2009. http://lwn.net/Articles/362339/.&amp;lt;/ref&amp;gt; to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering&amp;lt;ref&amp;gt;J. Edge. Receive Flow Steering, April 2010. http://lwn.net/Articles/382428/.&amp;lt;/ref&amp;gt; for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
{{Reflist}}&lt;br /&gt;
&lt;br /&gt;
[1] J. Corbet. Receive Packet Steering, November 2009. http://lwn.net/Articles/362339/.&lt;br /&gt;
&lt;br /&gt;
[2] J. Edge. Receive Flow Steering, April 2010. http://lwn.net/Articles/382428/.&lt;br /&gt;
&lt;br /&gt;
===Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;===&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
===Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;===&lt;br /&gt;
This section describes a specific instance of unnecessary locking.&lt;br /&gt;
&lt;br /&gt;
===Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;===&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
===Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;===&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;===&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
==Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;==&lt;br /&gt;
===Work in Progress===&lt;br /&gt;
&lt;br /&gt;
====[[Rovic P.]]====&lt;br /&gt;
-I&#039;m just using this as a notepad, do not copy/paste this section, I will put in a properly written set of paragraphs which will fit with the contribution questions asked. -RP&lt;br /&gt;
&lt;br /&gt;
This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. This has also shown that scaling in application programming should be more the focus. It has been shown that simple scaling techniques (list techniques) such as programming parallelism (look up more stuff to back this up and quotes). (Sloppy counter effectiveness, possible positive contributions, what has been used (internet search), what hasn’t been used.) Read conclusion, 2nd paragraph.&lt;br /&gt;
&lt;br /&gt;
One reason the&lt;br /&gt;
required changes are modest is that stock Linux already&lt;br /&gt;
incorporates many modifications to improve scalability.&lt;br /&gt;
More speculatively, perhaps it is the case that Linux’s&lt;br /&gt;
system-call API is well suited to an implementation that&lt;br /&gt;
avoids unnecessary contention over kernel objects.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====[[Rannath]]====&lt;br /&gt;
Everything so far indicates that the MOSBENCH applications can scale to 48 cores. This scaling required a few modest changes to remove bottlenecks. The MIT team speculate that that trend will continue as the number of cores increase. They also state that things not bottlenecked by the CPU are harder to fix. &lt;br /&gt;
&lt;br /&gt;
We can eliminate most kernel bottlenecks that the applications hits most often with minor changes. Most changes were well known methodology, with the exception of Sloppy counters. This study is limited by the removal of the IO bottleneck, but it does suggest that traditional implementations can be made scalable.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content;&lt;br /&gt;
 be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
 Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
===Content(Fairness): &#039;&#039;Section 5&#039;&#039;===&lt;br /&gt;
&lt;br /&gt;
====memcached: &#039;&#039;Section 5.3&#039;&#039;====&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that running multiple instances of the server is the same, or nearly the same, as running one larger server [3]. In the end memcached was bottlenecked by the network card.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 5.4&#039;&#039;====&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Which is not a problem as the paper specifically states that they are testing what they can improve in spite of hardware limitation.&#039;&#039; - [[Rannath]]&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 5.6&#039;&#039;====&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
====Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;====&lt;br /&gt;
Given that all tests are more or less fair for the purposes of the benchmarks. They would support the Hypothesis that Linux can be made to scale, at least to 48 cores. Thus the conclusion is fair iff the rest of the paper is fair.&lt;br /&gt;
&lt;br /&gt;
 Now you just have to fill in how fair the rest of the paper is.&lt;br /&gt;
&lt;br /&gt;
===Style===&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing? Wrong?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
Everything seems to be in logical order. I couldn&#039;t find any needless info. Nothing inherently confusing or wrong. Nothing bad on the grammar front either. - Rannath&lt;br /&gt;
&lt;br /&gt;
Some acronyms aren&#039;t explained before they are used, so some people reading the paper may get confused as to what they mean (e.g. Linux TLB). Since this paper is meant to be formal, acronyms should be explained, with some exceptions like OS and IBM. - Daniel B.&lt;br /&gt;
&lt;br /&gt;
Your example has no impact on the paper, it was in the &amp;quot;look here for more info&amp;quot; section. Most people wouldn&#039;t know what a &amp;quot;translation look-aside buffer&amp;quot; is either.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6448</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6448"/>
		<updated>2010-12-02T17:44:19Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Multicore packet processing: Section 4.2 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*flesh out the whole lot&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* So here is the claims and unclaimed section. Add your name next to one if you want to take it on.&lt;br /&gt;
** gmake - Daniel B.&lt;br /&gt;
** memcached - Rannath&lt;br /&gt;
** Apache - Kirill&lt;br /&gt;
** [[(Exim, PostgreSQL, Metis, and Psearchy will not be needed as the professor said we only need to explain 3)]]&lt;br /&gt;
** Research Problem - Andrew&lt;br /&gt;
** Contribution - Rovic&lt;br /&gt;
** Essay Conclusion (also discussion) - Everyone&lt;br /&gt;
** Critic, Style - Everyone&lt;br /&gt;
** References - Everyone&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts - DONE!!!==&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem==&lt;br /&gt;
  my references are just below because it is easier for numbering the data later.&lt;br /&gt;
&lt;br /&gt;
As technological progress the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system[1]. The problem with a standard Linux operating is they are not designed for massive scalability which will soon be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic that situation makes sense because 48 cores are dividing the work. But when processing information a process the main goal is to finish so as long as possible every core should be doing a much work as possible.&lt;br /&gt;
  &lt;br /&gt;
To fix those scalability issues it is necessary to focus on three major areas: the Linux kernel, user level design and how application use of kernel services. The Linux kernel can be improved be to improve sharing and have the advantage of recent iterations are beginning to implement scalability features. On the user level design applications can be improved so that there is more focus on parallelism since some programs have not implements those improved features. The final aspect of improving scalability is how an application uses kernel services to share resources better so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found actually only take a little work to avoid.[1]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This research is based on much research which was created before in the development of scalability for UNIX system.  The major developments from shared memory machines [2], wait-free synchronization to fast message passing have created a base set of techniques which can be used to improve scalability. These techniques have been incorporated in all major operation system including Linux, Mac OS X and Windows.  Linux has been improved with kernel subsystems such as Read-Copy-Update which an algorithm for which is used to avoid locks and atomic instructions which lower scalability.[3] The is also an excellent base a research on Linux scalability studies to base this research paper. These paper include a on doing scalability on a 32-core machine. [4] That research can improve the results by learning from the experiments already performed by researchers. This research also aid identifying bottlenecks which speed up researching solutions for those bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[2] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pages 302–313,1994.&lt;br /&gt;
&lt;br /&gt;
[3] P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen, O. Krieger, and R. Russell. Read-copy-update.  In Proceedings of the Linux Symposium 2002, pages 338-367, Ottawa Ontario, June 2002&lt;br /&gt;
&lt;br /&gt;
[4] C. Yan, Y. Chen, and S. Yuanchun. OSMark: A benchmark suite for understanding parallel scalability of operating systems on large scale multi-cores. In 2009 2nd International Conference on Computer Science and Information Technology, pages 313–317, 2009&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
  &lt;br /&gt;
 - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
All contributions in this paper are the result of the identification and removal or marginalization of bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;===&lt;br /&gt;
*The percentage of serialization in a program has a lot to do with how much an application can be sped up. This is Amdahl&#039;s Law&lt;br /&gt;
** Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the proportion of the program that cannot be made parallel (e.g. 25%(.25) non-parallel --&amp;gt; limit of 4x speedup) (I can&#039;t get this to sound right someone fix it please -[[Rannath]] &amp;lt;- I will fix [[Daniel B.]]&lt;br /&gt;
*Types of serializing interactions found in the MOSBENCH apps:	 &lt;br /&gt;
**Locking of shared data structure as the number of cores increase leads to an increase in lock wait time	 &lt;br /&gt;
**Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol	 &lt;br /&gt;
**Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate	 &lt;br /&gt;
**Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources	 &lt;br /&gt;
**Not enough tasks for cores leads to idle cores&lt;br /&gt;
&lt;br /&gt;
===Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;===&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering[1] to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering[2] for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
[1] J. Corbet. Receive Packet Steering, November 2009. http://lwn.net/Articles/362339/.&lt;br /&gt;
&lt;br /&gt;
[2] J. Edge. Receive Flow Steering, April 2010. http://lwn.net/Articles/382428/.&lt;br /&gt;
&lt;br /&gt;
===Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;===&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
===Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;===&lt;br /&gt;
This section describes a specific instance of unnecessary locking.&lt;br /&gt;
&lt;br /&gt;
===Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;===&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
===Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;===&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;===&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
==Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;==&lt;br /&gt;
===Work in Progress===&lt;br /&gt;
&lt;br /&gt;
====[[Rovic P.]]====&lt;br /&gt;
-I&#039;m just using this as a notepad, do not copy/paste this section, I will put in a properly written set of paragraphs which will fit with the contribution questions asked. -RP&lt;br /&gt;
&lt;br /&gt;
This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. This has also shown that scaling in application programming should be more the focus. It has been shown that simple scaling techniques (list techniques) such as programming parallelism (look up more stuff to back this up and quotes). (Sloppy counter effectiveness, possible positive contributions, what has been used (internet search), what hasn’t been used.) Read conclusion, 2nd paragraph.&lt;br /&gt;
&lt;br /&gt;
One reason the&lt;br /&gt;
required changes are modest is that stock Linux already&lt;br /&gt;
incorporates many modifications to improve scalability.&lt;br /&gt;
More speculatively, perhaps it is the case that Linux’s&lt;br /&gt;
system-call API is well suited to an implementation that&lt;br /&gt;
avoids unnecessary contention over kernel objects.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====[[Rannath]]====&lt;br /&gt;
Everything so far indicates that the MOSBENCH applications can scale to 48 cores. This scaling required a few modest changes to remove bottlenecks. The MIT team speculate that that trend will continue as the number of cores increase. They also state that things not bottlenecked by the CPU are harder to fix. &lt;br /&gt;
&lt;br /&gt;
We can eliminate most kernel bottlenecks that the applications hits most often with minor changes. Most changes were well known methodology, with the exception of Sloppy counters. This study is limited by the removal of the IO bottleneck, but it does suggest that traditional implementations can be made scalable.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content;&lt;br /&gt;
 be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
 Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
===Content(Fairness): &#039;&#039;Section 5&#039;&#039;===&lt;br /&gt;
&lt;br /&gt;
====memcached: &#039;&#039;Section 5.3&#039;&#039;====&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that running multiple instances of the server is the same, or nearly the same, as running one larger server [3]. In the end memcached was bottlenecked by the network card.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 5.4&#039;&#039;====&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Which is not a problem as the paper specifically states that they are testing what they can improve in spite of hardware limitation.&#039;&#039; - [[Rannath]]&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 5.6&#039;&#039;====&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
====Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;====&lt;br /&gt;
Given that all tests are more or less fair for the purposes of the benchmarks. They would support the Hypothesis that Linux can be made to scale, at least to 48 cores. Thus the conclusion is fair iff the rest of the paper is fair.&lt;br /&gt;
&lt;br /&gt;
 Now you just have to fill in how fair the rest of the paper is.&lt;br /&gt;
&lt;br /&gt;
===Style===&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing? Wrong?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
Everything seems to be in logical order. I couldn&#039;t find any needless info. Nothing inherently confusing or wrong. Nothing bad on the grammar front either. - Rannath&lt;br /&gt;
&lt;br /&gt;
Some acronyms aren&#039;t explained before they are used, so some people reading the paper may get confused as to what they mean (e.g. Linux TLB). Since this paper is meant to be formal, acronyms should be explained, with some exceptions like OS and IBM. - Daniel B.&lt;br /&gt;
&lt;br /&gt;
Your example has no impact on the paper, it was in the &amp;quot;look here for more info&amp;quot; section. Most people wouldn&#039;t know what a &amp;quot;translation look-aside buffer&amp;quot; is either.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6447</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6447"/>
		<updated>2010-12-02T17:43:55Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Multicore packet processing: Section 4.2 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*flesh out the whole lot&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* So here is the claims and unclaimed section. Add your name next to one if you want to take it on.&lt;br /&gt;
** gmake - Daniel B.&lt;br /&gt;
** memcached - Rannath&lt;br /&gt;
** Apache - Kirill&lt;br /&gt;
** [[(Exim, PostgreSQL, Metis, and Psearchy will not be needed as the professor said we only need to explain 3)]]&lt;br /&gt;
** Research Problem - Andrew&lt;br /&gt;
** Contribution - Rovic&lt;br /&gt;
** Essay Conclusion (also discussion) - Everyone&lt;br /&gt;
** Critic, Style - Everyone&lt;br /&gt;
** References - Everyone&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts - DONE!!!==&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem==&lt;br /&gt;
  my references are just below because it is easier for numbering the data later.&lt;br /&gt;
&lt;br /&gt;
As technological progress the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system[1]. The problem with a standard Linux operating is they are not designed for massive scalability which will soon be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic that situation makes sense because 48 cores are dividing the work. But when processing information a process the main goal is to finish so as long as possible every core should be doing a much work as possible.&lt;br /&gt;
  &lt;br /&gt;
To fix those scalability issues it is necessary to focus on three major areas: the Linux kernel, user level design and how application use of kernel services. The Linux kernel can be improved be to improve sharing and have the advantage of recent iterations are beginning to implement scalability features. On the user level design applications can be improved so that there is more focus on parallelism since some programs have not implements those improved features. The final aspect of improving scalability is how an application uses kernel services to share resources better so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found actually only take a little work to avoid.[1]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This research is based on much research which was created before in the development of scalability for UNIX system.  The major developments from shared memory machines [2], wait-free synchronization to fast message passing have created a base set of techniques which can be used to improve scalability. These techniques have been incorporated in all major operation system including Linux, Mac OS X and Windows.  Linux has been improved with kernel subsystems such as Read-Copy-Update which an algorithm for which is used to avoid locks and atomic instructions which lower scalability.[3] The is also an excellent base a research on Linux scalability studies to base this research paper. These paper include a on doing scalability on a 32-core machine. [4] That research can improve the results by learning from the experiments already performed by researchers. This research also aid identifying bottlenecks which speed up researching solutions for those bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[2] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pages 302–313,1994.&lt;br /&gt;
&lt;br /&gt;
[3] P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen, O. Krieger, and R. Russell. Read-copy-update.  In Proceedings of the Linux Symposium 2002, pages 338-367, Ottawa Ontario, June 2002&lt;br /&gt;
&lt;br /&gt;
[4] C. Yan, Y. Chen, and S. Yuanchun. OSMark: A benchmark suite for understanding parallel scalability of operating systems on large scale multi-cores. In 2009 2nd International Conference on Computer Science and Information Technology, pages 313–317, 2009&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
  &lt;br /&gt;
 - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
All contributions in this paper are the result of the identification and removal or marginalization of bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;===&lt;br /&gt;
*The percentage of serialization in a program has a lot to do with how much an application can be sped up. This is Amdahl&#039;s Law&lt;br /&gt;
** Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the proportion of the program that cannot be made parallel (e.g. 25%(.25) non-parallel --&amp;gt; limit of 4x speedup) (I can&#039;t get this to sound right someone fix it please -[[Rannath]] &amp;lt;- I will fix [[Daniel B.]]&lt;br /&gt;
*Types of serializing interactions found in the MOSBENCH apps:	 &lt;br /&gt;
**Locking of shared data structure as the number of cores increase leads to an increase in lock wait time	 &lt;br /&gt;
**Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol	 &lt;br /&gt;
**Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate	 &lt;br /&gt;
**Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources	 &lt;br /&gt;
**Not enough tasks for cores leads to idle cores&lt;br /&gt;
&lt;br /&gt;
===Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;===&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering[1] to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering[2] for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;ref&amp;gt;[1] J. Corbet. Receive Packet Steering, November 2009. http://lwn.net/Articles/362339/.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;ref&amp;gt;[2] J. Edge. Receive Flow Steering, April 2010. http://lwn.net/Articles/382428/.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;===&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
===Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;===&lt;br /&gt;
This section describes a specific instance of unnecessary locking.&lt;br /&gt;
&lt;br /&gt;
===Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;===&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
===Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;===&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;===&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
==Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;==&lt;br /&gt;
===Work in Progress===&lt;br /&gt;
&lt;br /&gt;
====[[Rovic P.]]====&lt;br /&gt;
-I&#039;m just using this as a notepad, do not copy/paste this section, I will put in a properly written set of paragraphs which will fit with the contribution questions asked. -RP&lt;br /&gt;
&lt;br /&gt;
This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. This has also shown that scaling in application programming should be more the focus. It has been shown that simple scaling techniques (list techniques) such as programming parallelism (look up more stuff to back this up and quotes). (Sloppy counter effectiveness, possible positive contributions, what has been used (internet search), what hasn’t been used.) Read conclusion, 2nd paragraph.&lt;br /&gt;
&lt;br /&gt;
One reason the&lt;br /&gt;
required changes are modest is that stock Linux already&lt;br /&gt;
incorporates many modifications to improve scalability.&lt;br /&gt;
More speculatively, perhaps it is the case that Linux’s&lt;br /&gt;
system-call API is well suited to an implementation that&lt;br /&gt;
avoids unnecessary contention over kernel objects.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====[[Rannath]]====&lt;br /&gt;
Everything so far indicates that the MOSBENCH applications can scale to 48 cores. This scaling required a few modest changes to remove bottlenecks. The MIT team speculate that that trend will continue as the number of cores increase. They also state that things not bottlenecked by the CPU are harder to fix. &lt;br /&gt;
&lt;br /&gt;
We can eliminate most kernel bottlenecks that the applications hits most often with minor changes. Most changes were well known methodology, with the exception of Sloppy counters. This study is limited by the removal of the IO bottleneck, but it does suggest that traditional implementations can be made scalable.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content;&lt;br /&gt;
 be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
 Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
===Content(Fairness): &#039;&#039;Section 5&#039;&#039;===&lt;br /&gt;
&lt;br /&gt;
====memcached: &#039;&#039;Section 5.3&#039;&#039;====&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that running multiple instances of the server is the same, or nearly the same, as running one larger server [3]. In the end memcached was bottlenecked by the network card.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 5.4&#039;&#039;====&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Which is not a problem as the paper specifically states that they are testing what they can improve in spite of hardware limitation.&#039;&#039; - [[Rannath]]&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 5.6&#039;&#039;====&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
====Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;====&lt;br /&gt;
Given that all tests are more or less fair for the purposes of the benchmarks. They would support the Hypothesis that Linux can be made to scale, at least to 48 cores. Thus the conclusion is fair iff the rest of the paper is fair.&lt;br /&gt;
&lt;br /&gt;
 Now you just have to fill in how fair the rest of the paper is.&lt;br /&gt;
&lt;br /&gt;
===Style===&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing? Wrong?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
Everything seems to be in logical order. I couldn&#039;t find any needless info. Nothing inherently confusing or wrong. Nothing bad on the grammar front either. - Rannath&lt;br /&gt;
&lt;br /&gt;
Some acronyms aren&#039;t explained before they are used, so some people reading the paper may get confused as to what they mean (e.g. Linux TLB). Since this paper is meant to be formal, acronyms should be explained, with some exceptions like OS and IBM. - Daniel B.&lt;br /&gt;
&lt;br /&gt;
Your example has no impact on the paper, it was in the &amp;quot;look here for more info&amp;quot; section. Most people wouldn&#039;t know what a &amp;quot;translation look-aside buffer&amp;quot; is either.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6439</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6439"/>
		<updated>2010-12-02T17:39:18Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Multicore packet processing: Section 4.2 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*flesh out the whole lot&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* So here is the claims and unclaimed section. Add your name next to one if you want to take it on.&lt;br /&gt;
** gmake - Daniel B.&lt;br /&gt;
** memcached - Rannath&lt;br /&gt;
** Apache - Kirill&lt;br /&gt;
** [[(Exim, PostgreSQL, Metis, and Psearchy will not be needed as the professor said we only need to explain 3)]]&lt;br /&gt;
** Research Problem - Andrew&lt;br /&gt;
** Contribution - Rovic&lt;br /&gt;
** Essay Conclusion (also discussion) - Everyone&lt;br /&gt;
** Critic, Style - Everyone&lt;br /&gt;
** References - Everyone&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts==&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem==&lt;br /&gt;
  my references are just below because it is easier for numbering the data later.&lt;br /&gt;
&lt;br /&gt;
As technological progress the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system[1]. The problem with a standard Linux operating is they are not designed for massive scalability which will soon be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic that situation makes sense because 48 cores are dividing the work. But when processing information a process the main goal is to finish so as long as possible every core should be doing a much work as possible.&lt;br /&gt;
  &lt;br /&gt;
To fix those scalability issues it is necessary to focus on three major areas: the Linux kernel, user level design and how application use of kernel services. The Linux kernel can be improved be to improve sharing and have the advantage of recent iterations are beginning to implement scalability features. On the user level design applications can be improved so that there is more focus on parallelism since some programs have not implements those improved features. The final aspect of improving scalability is how an application uses kernel services to share resources better so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found actually only take a little work to avoid.[1]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This research is based on much research which was created before in the development of scalability for UNIX system.  The major developments from shared memory machines [2], wait-free synchronization to fast message passing have created a base set of techniques which can be used to improve scalability. These techniques have been incorporated in all major operation system including Linux, Mac OS X and Windows.  Linux has been improved with kernel subsystems such as Read-Copy-Update which an algorithm for which is used to avoid locks and atomic instructions which lower scalability.[3] The is also an excellent base a research on Linux scalability studies to base this research paper. These paper include a on doing scalability on a 32-core machine. [4] That research can improve the results by learning from the experiments already performed by researchers. This research also aid identifying bottlenecks which speed up researching solutions for those bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[2] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pages 302–313,1994.&lt;br /&gt;
&lt;br /&gt;
[3] P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen, O. Krieger, and R. Russell. Read-copy-update.  In Proceedings of the Linux Symposium 2002, pages 338-367, Ottawa Ontario, June 2002&lt;br /&gt;
&lt;br /&gt;
[4] C. Yan, Y. Chen, and S. Yuanchun. OSMark: A benchmark suite for understanding parallel scalability of operating systems on large scale multi-cores. In 2009 2nd International Conference on Computer Science and Information Technology, pages 313–317, 2009&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
  &lt;br /&gt;
 - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
All contributions in this paper are the result of the identification and removal or marginalization of bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;===&lt;br /&gt;
*The percentage of serialization in a program has a lot to do with how much an application can be sped up. This is Amdahl&#039;s Law&lt;br /&gt;
** Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the proportion of the program that cannot be made parallel (e.g. 25%(.25) non-parallel --&amp;gt; limit of 4x speedup) (I can&#039;t get this to sound right someone fix it please -[[Rannath]] &amp;lt;- I will fix [[Daniel B.]]&lt;br /&gt;
*Types of serializing interactions found in the MOSBENCH apps:	 &lt;br /&gt;
**Locking of shared data structure as the number of cores increase leads to an increase in lock wait time	 &lt;br /&gt;
**Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol	 &lt;br /&gt;
**Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate	 &lt;br /&gt;
**Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources	 &lt;br /&gt;
**Not enough tasks for cores leads to idle cores&lt;br /&gt;
&lt;br /&gt;
===Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;===&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering[1] to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering[2] for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
[1] J. Corbet. Receive Packet Steering, November 2009. http://lwn.net/Articles/362339/.&lt;br /&gt;
&lt;br /&gt;
[2] J. Edge. Receive Flow Steering, April 2010. http://lwn.net/Articles/382428/.&lt;br /&gt;
&lt;br /&gt;
===Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;===&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
===Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;===&lt;br /&gt;
This section describes a specific instance of unnecessary locking.&lt;br /&gt;
&lt;br /&gt;
===Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;===&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
===Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;===&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;===&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
==Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;==&lt;br /&gt;
===Work in Progress===&lt;br /&gt;
&lt;br /&gt;
====[[Rovic P.]]====&lt;br /&gt;
-I&#039;m just using this as a notepad, do not copy/paste this section, I will put in a properly written set of paragraphs which will fit with the contribution questions asked. -RP&lt;br /&gt;
&lt;br /&gt;
This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. This has also shown that scaling in application programming should be more the focus. It has been shown that simple scaling techniques (list techniques) such as programming parallelism (look up more stuff to back this up and quotes). (Sloppy counter effectiveness, possible positive contributions, what has been used (internet search), what hasn’t been used.) Read conclusion, 2nd paragraph.&lt;br /&gt;
&lt;br /&gt;
One reason the&lt;br /&gt;
required changes are modest is that stock Linux already&lt;br /&gt;
incorporates many modifications to improve scalability.&lt;br /&gt;
More speculatively, perhaps it is the case that Linux’s&lt;br /&gt;
system-call API is well suited to an implementation that&lt;br /&gt;
avoids unnecessary contention over kernel objects.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====[[Rannath]]====&lt;br /&gt;
Everything so far indicates that the MOSBENCH applications can scale to 48 cores. This scaling required a few modest changes to remove bottlenecks. The MIT team speculate that that trend will continue as the number of cores increase. They also state that things not bottlenecked by the CPU are harder to fix. &lt;br /&gt;
&lt;br /&gt;
We can eliminate most kernel bottlenecks that the applications hits most often with minor changes. Most changes were well known methodology, with the exception of Sloppy counters. This study is limited by the removal of the IO bottleneck, but it does suggest that traditional implementations can be made scalable.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content;&lt;br /&gt;
 be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
 Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
===Content(Fairness): &#039;&#039;Section 5&#039;&#039;===&lt;br /&gt;
&lt;br /&gt;
====memcached: &#039;&#039;Section 5.3&#039;&#039;====&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that running multiple instances of the server is the same, or nearly the same, as running one larger server [3]. In the end memcached was bottlenecked by the network card.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 5.4&#039;&#039;====&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Which is not a problem as the paper specifically states that they are testing what they can improve in spite of hardware limitation.&#039;&#039; - [[Rannath]]&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 5.6&#039;&#039;====&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
====Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;====&lt;br /&gt;
Given that all tests are more or less fair for the purposes of the benchmarks. They would support the Hypothesis that Linux can be made to scale, at least to 48 cores. Thus the conclusion is fair iff the rest of the paper is fair.&lt;br /&gt;
&lt;br /&gt;
 Now you just have to fill in how fair the rest of the paper is.&lt;br /&gt;
&lt;br /&gt;
===Style===&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing? Wrong?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
Everything seems to be in logical order. I couldn&#039;t find any needless info. Nothing inherently confusing or wrong. Nothing bad on the grammar front either. - Rannath&lt;br /&gt;
&lt;br /&gt;
Some acronyms aren&#039;t explained before they are used, so some people reading the paper may get confused as to what they mean (e.g. Linux TLB). Since this paper is meant to be formal, acronyms should be explained, with some exceptions like OS and IBM. - Daniel B.&lt;br /&gt;
&lt;br /&gt;
Your example has no impact on the paper, it was in the &amp;quot;look here for more info&amp;quot; section. Most people wouldn&#039;t know what a &amp;quot;translation look-aside buffer&amp;quot; is either.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6437</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6437"/>
		<updated>2010-12-02T17:37:52Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Multicore packet processing: Section 4.2 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*flesh out the whole lot&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* So here is the claims and unclaimed section. Add your name next to one if you want to take it on.&lt;br /&gt;
** gmake - Daniel B.&lt;br /&gt;
** memcached - Rannath&lt;br /&gt;
** Apache - Kirill&lt;br /&gt;
** [[(Exim, PostgreSQL, Metis, and Psearchy will not be needed as the professor said we only need to explain 3)]]&lt;br /&gt;
** Research Problem - Andrew&lt;br /&gt;
** Contribution - Rovic&lt;br /&gt;
** Essay Conclusion (also discussion) - Everyone&lt;br /&gt;
** Critic, Style - Everyone&lt;br /&gt;
** References - Everyone&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts==&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.[1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server that has been used in previous Linux scalability studies. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (making it a perfect example of parallel programming). Each process uses one of their threads to accepting incoming connections and others are used to process these connections. On a single core processor, Apache spends 60% of its execution time in the kernel.[1]&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem==&lt;br /&gt;
  my references are just below because it is easier for numbering the data later.&lt;br /&gt;
&lt;br /&gt;
As technological progress the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system[1]. The problem with a standard Linux operating is they are not designed for massive scalability which will soon be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic that situation makes sense because 48 cores are dividing the work. But when processing information a process the main goal is to finish so as long as possible every core should be doing a much work as possible.&lt;br /&gt;
  &lt;br /&gt;
To fix those scalability issues it is necessary to focus on three major areas: the Linux kernel, user level design and how application use of kernel services. The Linux kernel can be improved be to improve sharing and have the advantage of recent iterations are beginning to implement scalability features. On the user level design applications can be improved so that there is more focus on parallelism since some programs have not implements those improved features. The final aspect of improving scalability is how an application uses kernel services to share resources better so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found actually only take a little work to avoid.[1]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This research is based on much research which was created before in the development of scalability for UNIX system.  The major developments from shared memory machines [2], wait-free synchronization to fast message passing have created a base set of techniques which can be used to improve scalability. These techniques have been incorporated in all major operation system including Linux, Mac OS X and Windows.  Linux has been improved with kernel subsystems such as Read-Copy-Update which an algorithm for which is used to avoid locks and atomic instructions which lower scalability.[3] The is also an excellent base a research on Linux scalability studies to base this research paper. These paper include a on doing scalability on a 32-core machine. [4] That research can improve the results by learning from the experiments already performed by researchers. This research also aid identifying bottlenecks which speed up researching solutions for those bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[2] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pages 302–313,1994.&lt;br /&gt;
&lt;br /&gt;
[3] P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen, O. Krieger, and R. Russell. Read-copy-update.  In Proceedings of the Linux Symposium 2002, pages 338-367, Ottawa Ontario, June 2002&lt;br /&gt;
&lt;br /&gt;
[4] C. Yan, Y. Chen, and S. Yuanchun. OSMark: A benchmark suite for understanding parallel scalability of operating systems on large scale multi-cores. In 2009 2nd International Conference on Computer Science and Information Technology, pages 313–317, 2009&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
  &lt;br /&gt;
 - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
All contributions in this paper are the result of the identification and removal or marginalization of bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;===&lt;br /&gt;
*The percentage of serialization in a program has a lot to do with how much an application can be sped up. This is Amdahl&#039;s Law&lt;br /&gt;
** Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the proportion of the program that cannot be made parallel (e.g. 25%(.25) non-parallel --&amp;gt; limit of 4x speedup) (I can&#039;t get this to sound right someone fix it please -[[Rannath]] &amp;lt;- I will fix [[Daniel B.]]&lt;br /&gt;
*Types of serializing interactions found in the MOSBENCH apps:	 &lt;br /&gt;
**Locking of shared data structure as the number of cores increase leads to an increase in lock wait time	 &lt;br /&gt;
**Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol	 &lt;br /&gt;
**Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate	 &lt;br /&gt;
**Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources	 &lt;br /&gt;
**Not enough tasks for cores leads to idle cores&lt;br /&gt;
&lt;br /&gt;
===Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;===&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering[1] to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering[2] for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
[1] J. Corbet. Receive Packet Steering, November 2009. http://lwn.net/Articles/362339/.&lt;br /&gt;
[2] J. Edge. Receive Flow Steering, April 2010. http://lwn.net/Articles/382428/.&lt;br /&gt;
&lt;br /&gt;
===Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;===&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
===Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;===&lt;br /&gt;
This section describes a specific instance of unnecessary locking.&lt;br /&gt;
&lt;br /&gt;
===Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;===&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
===Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;===&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;===&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
==Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;==&lt;br /&gt;
===Work in Progress===&lt;br /&gt;
&lt;br /&gt;
====[[Rovic P.]]====&lt;br /&gt;
-I&#039;m just using this as a notepad, do not copy/paste this section, I will put in a properly written set of paragraphs which will fit with the contribution questions asked. -RP&lt;br /&gt;
&lt;br /&gt;
This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. This has also shown that scaling in application programming should be more the focus. It has been shown that simple scaling techniques (list techniques) such as programming parallelism (look up more stuff to back this up and quotes). (Sloppy counter effectiveness, possible positive contributions, what has been used (internet search), what hasn’t been used.) Read conclusion, 2nd paragraph.&lt;br /&gt;
&lt;br /&gt;
One reason the&lt;br /&gt;
required changes are modest is that stock Linux already&lt;br /&gt;
incorporates many modifications to improve scalability.&lt;br /&gt;
More speculatively, perhaps it is the case that Linux’s&lt;br /&gt;
system-call API is well suited to an implementation that&lt;br /&gt;
avoids unnecessary contention over kernel objects.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====[[Rannath]]====&lt;br /&gt;
Everything so far indicates that the MOSBENCH applications can scale to 48 cores. This scaling required a few modest changes to remove bottlenecks. The MIT team speculate that that trend will continue as the number of cores increase. They also state that things not bottlenecked by the CPU are harder to fix. &lt;br /&gt;
&lt;br /&gt;
We can eliminate most kernel bottlenecks that the applications hits most often with minor changes. Most changes were well known methodology, with the exception of Sloppy counters. This study is limited by the removal of the IO bottleneck, but it does suggest that traditional implementations can be made scalable.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content;&lt;br /&gt;
 be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
 Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
===Content(Fairness): &#039;&#039;Section 5&#039;&#039;===&lt;br /&gt;
&lt;br /&gt;
====memcached: &#039;&#039;Section 5.3&#039;&#039;====&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that running multiple instances of the server is the same, or nearly the same, as running one larger server [3]. In the end memcached was bottlenecked by the network card.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 5.4&#039;&#039;====&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Which is not a problem as the paper specifically states that they are testing what they can improve in spite of hardware limitation.&#039;&#039; - [[Rannath]]&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 5.6&#039;&#039;====&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
====Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;====&lt;br /&gt;
Given that all tests are more or less fair for the purposes of the benchmarks. They would support the Hypothesis that Linux can be made to scale, at least to 48 cores. Thus the conclusion is fair iff the rest of the paper is fair.&lt;br /&gt;
&lt;br /&gt;
 Now you just have to fill in how fair the rest of the paper is.&lt;br /&gt;
&lt;br /&gt;
===Style===&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing? Wrong?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
Everything seems to be in logical order. I couldn&#039;t find any needless info. Nothing inherently confusing or wrong. Nothing bad on the grammar front either. - Rannath&lt;br /&gt;
&lt;br /&gt;
Some acronyms aren&#039;t explained before they are used, so some people reading the paper may get confused as to what they mean (e.g. Linux TLB). Since this paper is meant to be formal, acronyms should be explained, with some exceptions like OS and IBM. - Daniel B.&lt;br /&gt;
&lt;br /&gt;
Your example has no impact on the paper, it was in the &amp;quot;look here for more info&amp;quot; section. Most people wouldn&#039;t know what a &amp;quot;translation look-aside buffer&amp;quot; is either.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6433</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6433"/>
		<updated>2010-12-02T17:26:09Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Multicore packet processing: Section 4.2 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*flesh out the whole lot&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* So here is the claims and unclaimed section. Add your name next to one if you want to take it on.&lt;br /&gt;
** gmake - Daniel B.&lt;br /&gt;
** memcached - Rannath&lt;br /&gt;
** Apache - Kirill&lt;br /&gt;
** [[(Exim, PostgreSQL, Metis, and Psearchy will not be needed as the professor said we only need to explain 3)]]&lt;br /&gt;
** Research Problem - Andrew&lt;br /&gt;
** Contribution - Rovic&lt;br /&gt;
** Essay Conclusion (also discussion) - Everyone&lt;br /&gt;
** Critic, Style - Everyone&lt;br /&gt;
** References - Everyone&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts==&lt;br /&gt;
 Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.&lt;br /&gt;
&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets. [1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (Making it a perfect example of parallel programming). One thread to service incoming connections and various other threads to service those connections. On a single core processor, Apache spends 60% of its execution time in the kernel.&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem==&lt;br /&gt;
  my references are just below because it is easier for numbering the data later.&lt;br /&gt;
&lt;br /&gt;
As technological progress the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system[1]. The problem with a standard Linux operating is they are not designed for massive scalability which will soon be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic that situation makes sense because 48 cores are dividing the work. But when processing information a process the main goal is to finish so as long as possible every core should be doing a much work as possible.&lt;br /&gt;
  &lt;br /&gt;
To fix those scalability issues it is necessary to focus on three major areas: the Linux kernel, user level design and how application use of kernel services. The Linux kernel can be improved be to improve sharing and have the advantage of recent iterations are beginning to implement scalability features. On the user level design applications can be improved so that there is more focus on parallelism since some programs have not implements those improved features. The final aspect of improving scalability is how an application uses kernel services to share resources better so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found actually only take a little work to avoid.[1]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This research is based on much research which was created before in the development of scalability for UNIX system.  The major developments from shared memory machines [2], wait-free synchronization to fast message passing have created a base set of techniques which can be used to improve scalability. These techniques have been incorporated in all major operation system including Linux, Mac OS X and Windows.  Linux has been improved with kernel subsystems such as Read-Copy-Update which an algorithm for which is used to avoid locks and atomic instructions which lower scalability.[3] The is also an excellent base a research on Linux scalability studies to base this research paper. These paper include a on doing scalability on a 32-core machine. [4] That research can improve the results by learning from the experiments already performed by researchers. This research also aid identifying bottlenecks which speed up researching solutions for those bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[2] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pages 302–313,1994.&lt;br /&gt;
&lt;br /&gt;
[3] P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen, O. Krieger, and R. Russell. Read-copy-update.  In Proceedings of the Linux Symposium 2002, pages 338-367, Ottawa Ontario, June 2002&lt;br /&gt;
&lt;br /&gt;
[4] C. Yan, Y. Chen, and S. Yuanchun. OSMark: A benchmark suite for understanding parallel scalability of operating systems on large scale multi-cores. In 2009 2nd International Conference on Computer Science and Information Technology, pages 313–317, 2009&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
  &lt;br /&gt;
 - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
All contributions in this paper are the result of the identification and removal or marginalization of bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;===&lt;br /&gt;
*The percentage of serialization in a program has a lot to do with how much an application can be sped up. This is Amdahl&#039;s Law&lt;br /&gt;
** Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the proportion of the program that cannot be made parallel (e.g. 25%(.25) non-parallel --&amp;gt; limit of 4x speedup) (I can&#039;t get this to sound right someone fix it please -[[Rannath]] &amp;lt;- I will fix [[Daniel B.]]&lt;br /&gt;
*Types of serializing interactions found in the MOSBENCH apps:	 &lt;br /&gt;
**Locking of shared data structure as the number of cores increase leads to an increase in lock wait time	 &lt;br /&gt;
**Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol	 &lt;br /&gt;
**Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate	 &lt;br /&gt;
**Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources	 &lt;br /&gt;
**Not enough tasks for cores leads to idle cores&lt;br /&gt;
&lt;br /&gt;
===Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;===&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering[1] to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering[2] for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
===Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;===&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
===Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;===&lt;br /&gt;
This section describes a specific instance of unnecessary locking.&lt;br /&gt;
&lt;br /&gt;
===Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;===&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
===Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;===&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;===&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
==Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;==&lt;br /&gt;
===Work in Progress===&lt;br /&gt;
&lt;br /&gt;
====[[Rovic P.]]====&lt;br /&gt;
-I&#039;m just using this as a notepad, do not copy/paste this section, I will put in a properly written set of paragraphs which will fit with the contribution questions asked. -RP&lt;br /&gt;
&lt;br /&gt;
This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. This has also shown that scaling in application programming should be more the focus. It has been shown that simple scaling techniques (list techniques) such as programming parallelism (look up more stuff to back this up and quotes). (Sloppy counter effectiveness, possible positive contributions, what has been used (internet search), what hasn’t been used.) Read conclusion, 2nd paragraph.&lt;br /&gt;
&lt;br /&gt;
One reason the&lt;br /&gt;
required changes are modest is that stock Linux already&lt;br /&gt;
incorporates many modifications to improve scalability.&lt;br /&gt;
More speculatively, perhaps it is the case that Linux’s&lt;br /&gt;
system-call API is well suited to an implementation that&lt;br /&gt;
avoids unnecessary contention over kernel objects.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====[[Rannath]]====&lt;br /&gt;
Everything so far indicates that the MOSBENCH applications can scale to 48 cores. This scaling required a few modest changes to remove bottlenecks. The MIT team speculate that that trend will continue as the number of cores increase. They also state that things not bottlenecked by the CPU are harder to fix. &lt;br /&gt;
&lt;br /&gt;
We can eliminate most kernel bottlenecks that the applications hits most often with minor changes. Most changes were well known methodology, with the exception of Sloppy counters. This study is limited by the removal of the IO bottleneck, but it does suggest that traditional implementations can be made scalable.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content;&lt;br /&gt;
 be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
 Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
===Content(Fairness): &#039;&#039;Section 5&#039;&#039;===&lt;br /&gt;
&lt;br /&gt;
====memcached: &#039;&#039;Section 5.3&#039;&#039;====&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that running multiple instances of the server is the same, or nearly the same, as running one larger server [3]. In the end memcached was bottlenecked by the network card.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 5.4&#039;&#039;====&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Which is not a problem as the paper specifically states that they are testing what they can improve in spite of hardware limitation.&#039;&#039; - [[Rannath]]&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 5.6&#039;&#039;====&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
====Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;====&lt;br /&gt;
Given that all tests are more or less fair for the purposes of the benchmarks. They would support the Hypothesis that Linux can be made to scale, at least to 48 cores. Thus the conclusion is fair iff the rest of the paper is fair.&lt;br /&gt;
&lt;br /&gt;
 Now you just have to fill in how fair the rest of the paper is.&lt;br /&gt;
&lt;br /&gt;
===Style===&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing? Wrong?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
Everything seems to be in logical order. I couldn&#039;t find any needless info. Nothing inherently confusing or wrong. Nothing bad on the grammar front either. - Rannath&lt;br /&gt;
&lt;br /&gt;
Some acronyms aren&#039;t explained before they are used, so some people reading the paper may get confused as to what they mean (e.g. Linux TLB). Since this paper is meant to be formal, acronyms should be explained, with some exceptions like OS and IBM. - Daniel B.&lt;br /&gt;
&lt;br /&gt;
Your example has no impact on the paper, it was in the &amp;quot;look here for more info&amp;quot; section. Most people wouldn&#039;t know what a &amp;quot;translation look-aside buffer&amp;quot; is either.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6432</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6432"/>
		<updated>2010-12-02T17:25:56Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Multicore packet processing: Section 4.2 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*flesh out the whole lot&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* So here is the claims and unclaimed section. Add your name next to one if you want to take it on.&lt;br /&gt;
** gmake - Daniel B.&lt;br /&gt;
** memcached - Rannath&lt;br /&gt;
** Apache - Kirill&lt;br /&gt;
** [[(Exim, PostgreSQL, Metis, and Psearchy will not be needed as the professor said we only need to explain 3)]]&lt;br /&gt;
** Research Problem - Andrew&lt;br /&gt;
** Contribution - Rovic&lt;br /&gt;
** Essay Conclusion (also discussion) - Everyone&lt;br /&gt;
** Critic, Style - Everyone&lt;br /&gt;
** References - Everyone&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
==Paper - DONE!!!==&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
The paper: [http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
==Background Concepts==&lt;br /&gt;
 Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.&lt;br /&gt;
&lt;br /&gt;
===memcached: &#039;&#039;Section 3.2&#039;&#039;===&lt;br /&gt;
memcached is an in-memory hash table server. One instance of memcached running on many different cores is bottlenecked by an internal lock, which is avoided by the MIT team by running one instance per core. Clients each connect to a single instance of memcached, allowing the server to simulate parallelism without needing to make major changes to the application or kernel. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets. [1]&lt;br /&gt;
&lt;br /&gt;
===Apache: &#039;&#039;Section 3.3&#039;&#039;===&lt;br /&gt;
Apache is a web server. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (Making it a perfect example of parallel programming). One thread to service incoming connections and various other threads to service those connections. On a single core processor, Apache spends 60% of its execution time in the kernel.&lt;br /&gt;
&lt;br /&gt;
===gmake: &#039;&#039;Section 3.5&#039;&#039;===&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake takes a file called a makefile and processes its recipes for the requisite files to determine how and when to remake or recompile code. With a simple command -j or --jobs, gmake can process many of these recipes in parallel. Since gmake creates more processes than cores, it can make proper use of multiple cores to process the recipes.[2] Since gmake involves much reading and writing, in order to prevent bottlenecks due to the filesystem or hardware, the test cases use an in-memory filesystem tmpfs, which gives them a backdoor around the bottlenecks for testing purposes. In addition to this, gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution, which limits its scalability to a small degree. gmake spends much of its execution time with its compiler, processing the recipes and recompiling code, but still spend 7.6% of its time in system time.[1]&lt;br /&gt;
&lt;br /&gt;
[2] http://www.gnu.org/software/make/manual/make.html&lt;br /&gt;
&lt;br /&gt;
==Research problem==&lt;br /&gt;
  my references are just below because it is easier for numbering the data later.&lt;br /&gt;
&lt;br /&gt;
As technological progress the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system[1]. The problem with a standard Linux operating is they are not designed for massive scalability which will soon be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic that situation makes sense because 48 cores are dividing the work. But when processing information a process the main goal is to finish so as long as possible every core should be doing a much work as possible.&lt;br /&gt;
  &lt;br /&gt;
To fix those scalability issues it is necessary to focus on three major areas: the Linux kernel, user level design and how application use of kernel services. The Linux kernel can be improved be to improve sharing and have the advantage of recent iterations are beginning to implement scalability features. On the user level design applications can be improved so that there is more focus on parallelism since some programs have not implements those improved features. The final aspect of improving scalability is how an application uses kernel services to share resources better so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found actually only take a little work to avoid.[1]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This research is based on much research which was created before in the development of scalability for UNIX system.  The major developments from shared memory machines [2], wait-free synchronization to fast message passing have created a base set of techniques which can be used to improve scalability. These techniques have been incorporated in all major operation system including Linux, Mac OS X and Windows.  Linux has been improved with kernel subsystems such as Read-Copy-Update which an algorithm for which is used to avoid locks and atomic instructions which lower scalability.[3] The is also an excellent base a research on Linux scalability studies to base this research paper. These paper include a on doing scalability on a 32-core machine. [4] That research can improve the results by learning from the experiments already performed by researchers. This research also aid identifying bottlenecks which speed up researching solutions for those bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[2] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pages 302–313,1994.&lt;br /&gt;
&lt;br /&gt;
[3] P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen, O. Krieger, and R. Russell. Read-copy-update.  In Proceedings of the Linux Symposium 2002, pages 338-367, Ottawa Ontario, June 2002&lt;br /&gt;
&lt;br /&gt;
[4] C. Yan, Y. Chen, and S. Yuanchun. OSMark: A benchmark suite for understanding parallel scalability of operating systems on large scale multi-cores. In 2009 2nd International Conference on Computer Science and Information Technology, pages 313–317, 2009&lt;br /&gt;
&lt;br /&gt;
==Contribution==&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
  &lt;br /&gt;
 - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
All contributions in this paper are the result of the identification and removal or marginalization of bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===What hinders scalability: &#039;&#039;Section 4.1&#039;&#039;===&lt;br /&gt;
*The percentage of serialization in a program has a lot to do with how much an application can be sped up. This is Amdahl&#039;s Law&lt;br /&gt;
** Amdahl&#039;s Law states that a parallel program can only be sped up by the inverse of the proportion of the program that cannot be made parallel (e.g. 25%(.25) non-parallel --&amp;gt; limit of 4x speedup) (I can&#039;t get this to sound right someone fix it please -[[Rannath]] &amp;lt;- I will fix [[Daniel B.]]&lt;br /&gt;
*Types of serializing interactions found in the MOSBENCH apps:	 &lt;br /&gt;
**Locking of shared data structure as the number of cores increase leads to an increase in lock wait time	 &lt;br /&gt;
**Writing to shared memory as the number of cores increase leads to an increase in the execution time of the cache coherence protocol	 &lt;br /&gt;
**Competing for space in shared hardware cache as the number of cores increase leads to an increase in cache miss rate	 &lt;br /&gt;
**Competing for shared hardware resources as the number of cores increase leads to time lost waiting for resources	 &lt;br /&gt;
**Not enough tasks for cores leads to idle cores&lt;br /&gt;
&lt;br /&gt;
===Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;===&lt;br /&gt;
Linux packet processing technique requires the packets to travel along several queues before it finally becomes available for the application to use. This technique works well for most general socket applications. In recent kernel releases Linux takes advantage of multiple hardware queues (when available on the given network interface) or Receive Packet Steering[1] to direct packet flow onto different cores for processing. Or even go as far as directing packet flow to the core on which the application is running using Receive Flow Steering[2] for even better performance. Linux also attempts to increase performance using a sampling technique where it checks every 20th outgoing packet and directs flow based on its hash. This poses a problem for short lived connections like those associated with Apache since there is great potential for packets to be misdirected.&lt;br /&gt;
In general this technique performs poorly when there are numerous open connections spread across multiple cores due to mutex (mutual exclusion) delays and cache misses. In such scenarios its better to process all connections, with associated packets and queues, on one core to avoid said issues. The patched kernel&#039;s implementation proposed in this article uses multiple hardware queues (which can be accomplished through Receive Packet Sharing) to direct all packets from a given connection to the same core. In turn Apache is modified to only accept connections if the thread dedicated to processing them is on the same core. If the current core&#039;s queue is found to be empty it will attempt to obtain work from queues located on different cores. This configuration is ideal for numerous short connections as all the work for them in accomplished quickly on one core avoiding unnecessary mutex delays associated with packet queues and inter-core cache misses.&lt;br /&gt;
&lt;br /&gt;
===Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;===&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
===Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;===&lt;br /&gt;
This section describes a specific instance of unnecessary locking.&lt;br /&gt;
&lt;br /&gt;
===Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;===&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
===Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;===&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
===Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;===&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
==Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;==&lt;br /&gt;
===Work in Progress===&lt;br /&gt;
&lt;br /&gt;
====[[Rovic P.]]====&lt;br /&gt;
-I&#039;m just using this as a notepad, do not copy/paste this section, I will put in a properly written set of paragraphs which will fit with the contribution questions asked. -RP&lt;br /&gt;
&lt;br /&gt;
This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. This has also shown that scaling in application programming should be more the focus. It has been shown that simple scaling techniques (list techniques) such as programming parallelism (look up more stuff to back this up and quotes). (Sloppy counter effectiveness, possible positive contributions, what has been used (internet search), what hasn’t been used.) Read conclusion, 2nd paragraph.&lt;br /&gt;
&lt;br /&gt;
One reason the&lt;br /&gt;
required changes are modest is that stock Linux already&lt;br /&gt;
incorporates many modifications to improve scalability.&lt;br /&gt;
More speculatively, perhaps it is the case that Linux’s&lt;br /&gt;
system-call API is well suited to an implementation that&lt;br /&gt;
avoids unnecessary contention over kernel objects.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====[[Rannath]]====&lt;br /&gt;
Everything so far indicates that the MOSBENCH applications can scale to 48 cores. This scaling required a few modest changes to remove bottlenecks. The MIT team speculate that that trend will continue as the number of cores increase. They also state that things not bottlenecked by the CPU are harder to fix. &lt;br /&gt;
&lt;br /&gt;
We can eliminate most kernel bottlenecks that the applications hits most often with minor changes. Most changes were well known methodology, with the exception of Sloppy counters. This study is limited by the removal of the IO bottleneck, but it does suggest that traditional implementations can be made scalable.&lt;br /&gt;
&lt;br /&gt;
==Critique==&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content;&lt;br /&gt;
 be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
 Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
===Content(Fairness): &#039;&#039;Section 5&#039;&#039;===&lt;br /&gt;
&lt;br /&gt;
====memcached: &#039;&#039;Section 5.3&#039;&#039;====&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that running multiple instances of the server is the same, or nearly the same, as running one larger server [3]. In the end memcached was bottlenecked by the network card.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 5.4&#039;&#039;====&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Which is not a problem as the paper specifically states that they are testing what they can improve in spite of hardware limitation.&#039;&#039; - [[Rannath]]&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 5.6&#039;&#039;====&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
====Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;====&lt;br /&gt;
Given that all tests are more or less fair for the purposes of the benchmarks. They would support the Hypothesis that Linux can be made to scale, at least to 48 cores. Thus the conclusion is fair iff the rest of the paper is fair.&lt;br /&gt;
&lt;br /&gt;
 Now you just have to fill in how fair the rest of the paper is.&lt;br /&gt;
&lt;br /&gt;
===Style===&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing? Wrong?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
Everything seems to be in logical order. I couldn&#039;t find any needless info. Nothing inherently confusing or wrong. Nothing bad on the grammar front either. - Rannath&lt;br /&gt;
&lt;br /&gt;
Some acronyms aren&#039;t explained before they are used, so some people reading the paper may get confused as to what they mean (e.g. Linux TLB). Since this paper is meant to be formal, acronyms should be explained, with some exceptions like OS and IBM. - Daniel B.&lt;br /&gt;
&lt;br /&gt;
Your example has no impact on the paper, it was in the &amp;quot;look here for more info&amp;quot; section. Most people wouldn&#039;t know what a &amp;quot;translation look-aside buffer&amp;quot; is either.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
[3] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;the paper itself doesn&#039;t need to be referenced more than once as this is a critique of the paper...&#039;&#039;&#039;&lt;br /&gt;
[1] Silas Boyd-Wickizer et al. &amp;quot;An Analysis of Linux Scalability to Many Cores&amp;quot;. In &#039;&#039;OSDI &#039;10, 9th USENIX Symposium on OS Design and Implementation&#039;&#039;, Vancouver, BC, Canada, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
==Deprecated==&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6340</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6340"/>
		<updated>2010-12-02T15:51:48Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Apache: Section 5.4 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. -Rannath&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*Background Concepts -fill in info (fii)&lt;br /&gt;
*Contribution -fii&lt;br /&gt;
*Critique -fii&lt;br /&gt;
*References -fii&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* I claim Exim and memcached for background and critique -[[Rannath]]&lt;br /&gt;
* also per-core data structures, false sharing and unessesary locking for contribution -[[Rannath]]&lt;br /&gt;
* For starters I will take the Scalability Tutorial and gmake. Since the part for gmake is short in the paper, I will grab a few more sections later on. - [[Daniel B.]]&lt;br /&gt;
* Also, I will take sloppy counters as well - [[Daniel B.]] &lt;br /&gt;
* I&#039;m gonna put some work into the apache and postgresql sections - kirill&lt;br /&gt;
* Just as a note Anil in class Thuesday the 30th of November said that we only need to explain 3 of the applications and not all 7 - [[Andrew]]&lt;br /&gt;
* I&#039;ll do the Research problem and contribution sections. - [[Andrew]]&lt;br /&gt;
* I will work on contribution - [[Rovic]]&lt;br /&gt;
* I&#039;m gonna whip something up for 4.2 since there appears to be nothing mentioned about it. -kirill&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* So here is the claims and unclaimed section. Add your name next to one if you want to take it on.&lt;br /&gt;
** gmake - Daniel B.&lt;br /&gt;
** memcached - Rannath&lt;br /&gt;
** Apache - Kirill&lt;br /&gt;
** [[(Exim, PostgreSQL, Metis, and Psearchy will not be needed as the professor said we only need to explain 3)]]&lt;br /&gt;
** Research Problem - Andrew&lt;br /&gt;
** Contribution - Rovic&lt;br /&gt;
** Critic, Style - Everyone&lt;br /&gt;
** Conclusion (also discussion) - Rannath, but I need someone to help flesh it out, I got the salient points down.&lt;br /&gt;
** References - Everyone&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
===Paper===&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
[http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf The paper: An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
 Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.&lt;br /&gt;
&lt;br /&gt;
 Ideas to explain:&lt;br /&gt;
 - thread (maybe)&lt;br /&gt;
 - Linux&#039;s move towards scalability precedes this paper. (assert this, no explanation needed, maybe a few examples)&lt;br /&gt;
 - Summarize scalability tutorial (Section 4.1 of the paper) focus on what makes something (non-)scalable&lt;br /&gt;
 - Describe the programs tested (what they do, how they&#039;re programmed (serial vs parallel), where to the do their processing)&lt;br /&gt;
&lt;br /&gt;
====memchached: &#039;&#039;Section 3.2&#039;&#039;====&lt;br /&gt;
memcached is an in-memory hash table server. One instance running on many cores is bottlenecked by an internal lock. The MIT team ran multiple instances to avoid the problem. Clients each connect to a single instance. This allows the server to simulate parallelism. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 3.3&#039;&#039;====&lt;br /&gt;
Apache is a web server. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (Making it a perfect example of parallel programming). One thread to service incoming connections and various other threads to service those connections. On a single core processor, Apache spends 60% of its execution time in the kernel.&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 3.5&#039;&#039;====&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake is already quite parallel, creating more processes than cores, so that it can make proper use of multiple cores, and involves much reading and writing of files, as it is used to build the Linux kernel. gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution. gmake spends much of its execution time with its compiler, but still spend 7.6% of its time in system time. [1]&lt;br /&gt;
&lt;br /&gt;
===Research problem===&lt;br /&gt;
  my references are just below because it is easier for numbering the data later.&lt;br /&gt;
&lt;br /&gt;
As technological progress the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system[1]. The problem with a standard Linux operating is they are not designed for massive scalability which will soon be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic that situation makes sense because 48 cores are dividing the work. But when processing information a process the main goal is to finish so as long as possible every core should be doing a much work as possible.&lt;br /&gt;
  &lt;br /&gt;
To fix those scalability issues it is necessary to focus on three major areas: the Linux kernel, user level design and how application use of kernel services. The Linux kernel can be improved be to improve sharing and have the advantage of recent iterations are beginning to implement scalability features. On the user level design applications can be improved so that there is more focus on parallelism since some programs have not implements those improved features. The final aspect of improving scalability is how an application uses kernel services to share resources better so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found actually only take a little work to avoid.[1]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This research is based on much research which was created before in the development of scalability for UNIX system.  The major developments from shared memory machines [2], wait-free synchronization to fast message passing have created a base set of techniques which can be used to improve scalability. These techniques have been incorporated in all major operation system including Linux, Mac OS X and Windows.  Linux has been improved with kernel subsystems such as Read-Copy-Update which an algorithm for which is used to avoid locks and atomic instructions which lower scalability.[3] The is also an excellent base a research on Linux scalability studies to base this research paper. These paper include a on doing scalability on a 32-core machine. [4] That research can improve the results by learning from the experiments already performed by researchers. This research also aid identifying bottlenecks which speed up researching solutions for those bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[2] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pages 302–313,1994.&lt;br /&gt;
&lt;br /&gt;
[3] P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen, O. Krieger, and R. Russell. Read-copy-update.  In Proceedings of the Linux Symposium 2002, pages 338-367, Ottawa Ontario, June 2002&lt;br /&gt;
&lt;br /&gt;
[4] C. Yan, Y. Chen, and S. Yuanchun. OSMark: A benchmark suite for understanding parallel scalability of operating systems on large scale multi-cores. In 2009 2nd International Conference on Computer Science and Information Technology, pages 313–317, 2009&lt;br /&gt;
&lt;br /&gt;
==Section 4.1 problems:==&lt;br /&gt;
**The percentage of serialization in a program has a lot to do with how much an application can be sped up. As from the example in the paper, it seems to follow Amdahl&#039;s law (e.g. 25% serialization --&amp;gt; limit of 4x speedup).&lt;br /&gt;
**Types of serializing interactions found in the MOSBENCH apps:&lt;br /&gt;
***Locking of shared data structure - increasing # of cores --&amp;gt; increase in lock wait time&lt;br /&gt;
***Writing to shared memory - increasing # of cores --&amp;gt; increase in wait for cache coherence protocol&lt;br /&gt;
***Competing for space in shared hardware cache - increasing # of cores --&amp;gt; increase in cache miss rate&lt;br /&gt;
***Competing for shared hardware resources - increasing # of cores --&amp;gt; increase in wait for resources&lt;br /&gt;
***Not enough tasks for cores --&amp;gt; idle cores&lt;br /&gt;
&lt;br /&gt;
===Contribution===&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
  &lt;br /&gt;
 - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
-I&#039;m just using this as a notepad, do not copy/paste this section, I will put in a properly written set of paragraphs which will fit with the contribution questions asked. -RP&lt;br /&gt;
&lt;br /&gt;
-==Work in Progress==-- -Rovic P.&lt;br /&gt;
This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. This has also shown that scaling in application programming should be more the focus. It has been shown that simple scaling techniques (list techniques) such as programming parallelism (look up more stuff to back this up and quotes). (Sloppy counter effectiveness, possible positive contributions, what has been used (internet search), what hasn’t been used.) Read conclusion, 2nd paragraph.&lt;br /&gt;
&lt;br /&gt;
One reason the&lt;br /&gt;
required changes are modest is that stock Linux already&lt;br /&gt;
incorporates many modifications to improve scalability.&lt;br /&gt;
More speculatively, perhaps it is the case that Linux’s&lt;br /&gt;
system-call API is well suited to an implementation that&lt;br /&gt;
avoids unnecessary contention over kernel objects.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;====&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
====Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;====&lt;br /&gt;
This section describes a specific instance of unnecessary locking.&lt;br /&gt;
&lt;br /&gt;
====Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;====&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
====Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;====&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
====Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;====&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
====Conclusion: &#039;&#039;Sections 6 &amp;amp; 7&#039;&#039;====&lt;br /&gt;
Everything so far indicates that MOSBENCH application can scale to 48 cores. This scaling required a few modest changes to remove bottlenecks. The MIT team speculate that that trend will continue as the number of cores increase. They also state that things not bottlenecked by the CPU are harder to fix. &lt;br /&gt;
&lt;br /&gt;
We can eliminate most kernel bottlenecks that the applications hits most often with minor changes. Most changes were well known methodology, with the exception of Sloppy counters. This study is limited by the removal of the IO bottleneck, but it does suggest that traditional implementations can be made scalable.&lt;br /&gt;
&lt;br /&gt;
===Critique===&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content;&lt;br /&gt;
 be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
&lt;br /&gt;
Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
====Content(Fairness): &#039;&#039;Section 5&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
=====memcached: &#039;&#039;Section 5.3&#039;&#039;=====&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that running multiple instances of the server is the same, or nearly the same, as running one larger server [1]. In the end memcached was bottlenecked by the network card.&lt;br /&gt;
&lt;br /&gt;
[1] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
=====Apache: &#039;&#039;Section 5.4&#039;&#039;=====&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. The patched kernel implementation of the network stack is also specific to the problem at hand, which is processing multiple short lived connections across multiple cores. Although this provides a performance increase in the given scenario, in more general applications network performance might suffer. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware. &#039;&#039;Which is not a problem as the paper specifically states that there are hardware limitations.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
=====gmake: &#039;&#039;Section 5.6&#039;&#039;=====&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
====Style====&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing? Wrong? or use bad methodology?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
===References===&lt;br /&gt;
[1] http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf&lt;br /&gt;
&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;br /&gt;
&lt;br /&gt;
===Deprecated===&lt;br /&gt;
====Background Concepts====&lt;br /&gt;
* Exim: &#039;&#039;Section 3.1&#039;&#039;: &lt;br /&gt;
**Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
* PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;: &lt;br /&gt;
**As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;br /&gt;
&lt;br /&gt;
* Psearchy: &#039;&#039;Section 3.6&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
* Metis: &#039;&#039;Section 3.7&#039;&#039;&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6320</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6320"/>
		<updated>2010-12-02T15:26:52Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Claim Sections */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
-- Go to Wireless Lab next to CCSS Lounge. Andrew and Dan B. will be there.&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want. &amp;lt;- who is this?&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*Background Concepts -fill in info (fii)&lt;br /&gt;
*Contribution -fii&lt;br /&gt;
*Critique -fii&lt;br /&gt;
*References -fii&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* I claim Exim and memcached for background and critique -[[Rannath]]&lt;br /&gt;
* also per-core data structures, false sharing and unessesary locking for contribution -[[Rannath]]&lt;br /&gt;
* For starters I will take the Scalability Tutorial and gmake. Since the part for gmake is short in the paper, I will grab a few more sections later on. - [[Daniel B.]]&lt;br /&gt;
* Also, I will take sloppy counters as well - [[Daniel B.]] &lt;br /&gt;
* I&#039;m gonna put some work into the apache and postgresql sections - kirill&lt;br /&gt;
* Just as a note Anil in class Thuesday the 30th of November said that we only need to explain 3 of the applications and not all 7 - [[Andrew]]&lt;br /&gt;
* I&#039;ll do the Research problem and contribution sections. - [[Andrew]]&lt;br /&gt;
* I will work on contribution - [[Rovic]]&lt;br /&gt;
* I&#039;m gonna whip something up for 4.2 since there appears to be nothing mentioned about it. -kirill&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
===Paper===&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
[http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf The paper: An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
 Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.&lt;br /&gt;
&lt;br /&gt;
 Ideas to explain:&lt;br /&gt;
 - thread (maybe)&lt;br /&gt;
 - Linux&#039;s move towards scalability precedes this paper. (assert this, no explanation needed, maybe a few examples)&lt;br /&gt;
 - Summarize scalability tutorial (Section 4.1 of the paper) focus on what makes something (non-)scalable&lt;br /&gt;
 - Describe the programs tested (what they do, how they&#039;re programmed (serial vs parallel), where to the do their processing)&lt;br /&gt;
&lt;br /&gt;
====Exim: &#039;&#039;Section 3.1&#039;&#039;====&lt;br /&gt;
Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
====memchached: &#039;&#039;Section 3.2&#039;&#039;====&lt;br /&gt;
memcached is an in-memory hash table server. One instance running on many cores is bottlenecked by an internal lock. The MIT team ran multiple instances to avoid the problem. Clients each connect to a single instance. This allows the server to simulate parallelism. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 3.3&#039;&#039;====&lt;br /&gt;
Apache is a web server. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (Making it a perfect example of parallel programming). One thread to service incoming connections and various other threads to service those connections. On a single core processor, Apache spends 60% of its execution time in the kernel.&lt;br /&gt;
&lt;br /&gt;
====PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;====&lt;br /&gt;
As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 3.5&#039;&#039;====&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake is already quite parallel, creating more processes than cores, so that it can make proper use of multiple cores, and involves much reading and writing of files, as it is used to build the Linux kernel. gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution. gmake spends much of its execution time with its compiler, but still spend 7.6% of its time in system time.&lt;br /&gt;
&lt;br /&gt;
====Psearchy: &#039;&#039;Section 3.6&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Metis: &#039;&#039;Section 3.7&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
===Research problem===&lt;br /&gt;
  my references are just below because it is easier for numbering the data later.&lt;br /&gt;
&lt;br /&gt;
As technological progress the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system[1]. The problem with a standard Linux operating is they are not designed for massive scalability which will soon be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic that situation makes sense because 48 cores are dividing the work. But when processing information a process the main goal is to finish so as long as possible every core should be doing a much work as possible.&lt;br /&gt;
  &lt;br /&gt;
To fix those scalability issues it is necessary to focus on three major areas: the Linux kernel, user level design and how application use of kernel services. The Linux kernel can be improved be to improve sharing and have the advantage of recent iterations are beginning to implement scalability features. On the user level design applications can be improved so that there is more focus on parallelism since some programs have not implements those improved features. The final aspect of improving scalability is how an application uses kernel services to share resources better so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found actually only take a little work to avoid.[1]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This research is based on much research which was created before in the development of scalability for UNIX system.  The major developments from shared memory machines [2], wait-free synchronization to fast message passing have created a base set of techniques which can be used to improve scalability. These techniques have been incorporated in all major operation system including Linux, Mac OS X and Windows.  Linux has been improved with kernel subsystems such as Read-Copy-Update which an algorithm for which is used to avoid locks and atomic instructions which lower scalability.[3] The is also an excellent base a research on Linux scalability studies to base this research paper. These paper include a on doing scalability on a 32-core machine. [4] That research can improve the results by learning from the experiments already performed by researchers. This research also aid identifying bottlenecks which speed up researching solutions for those bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[2] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pages 302–313,1994.&lt;br /&gt;
&lt;br /&gt;
[3] P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen, O. Krieger, and R. Russell. Read-copy-update.  In Proceedings of the Linux Symposium 2002, pages 338-367, Ottawa Ontario, June 2002&lt;br /&gt;
&lt;br /&gt;
[4] C. Yan, Y. Chen, and S. Yuanchun. OSMark: A benchmark suite for understanding parallel scalability of operating systems on large scale multi-cores. In 2009 2nd International Conference on Computer Science and Information Technology, pages 313–317, 2009&lt;br /&gt;
&lt;br /&gt;
==Section 4.1 problems:==&lt;br /&gt;
**The percentage of serialization in a program has a lot to do with how much an application can be sped up. As from the example in the paper, it seems to follow Amdahl&#039;s law (e.g. 25% serialization --&amp;gt; limit of 4x speedup).&lt;br /&gt;
**Types of serializing interactions found in the MOSBENCH apps:&lt;br /&gt;
***Locking of shared data structure - increasing # of cores --&amp;gt; increase in lock wait time&lt;br /&gt;
***Writing to shared memory - increasing # of cores --&amp;gt; increase in wait for cache coherence protocol&lt;br /&gt;
***Competing for space in shared hardware cache - increasing # of cores --&amp;gt; increase in cache miss rate&lt;br /&gt;
***Competing for shared hardware resources - increasing # of cores --&amp;gt; increase in wait for resources&lt;br /&gt;
***Not enough tasks for cores --&amp;gt; idle cores&lt;br /&gt;
&lt;br /&gt;
===Contribution===&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
  &lt;br /&gt;
 - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
-I&#039;m just using this as a notepad, do not copy/paste this section, I will put in a properly written set of paragraphs which will fit with the contribution questions asked. -RP&lt;br /&gt;
&lt;br /&gt;
-==Work in Progress==-- -Rovic P.&lt;br /&gt;
This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. This has also shown that scaling in application programming should be more the focus. It has been shown that simple scaling techniques (list techniques) such as programming parallelism (look up more stuff to back this up and quotes). (Sloppy counter effectiveness, possible positive contributions, what has been used (internet search), what hasn’t been used.) Read conclusion, 2nd paragraph.&lt;br /&gt;
&lt;br /&gt;
One reason the&lt;br /&gt;
required changes are modest is that stock Linux already&lt;br /&gt;
incorporates many modifications to improve scalability.&lt;br /&gt;
More speculatively, perhaps it is the case that Linux’s&lt;br /&gt;
system-call API is well suited to an implementation that&lt;br /&gt;
avoids unnecessary contention over kernel objects.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;====&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
====Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;====&lt;br /&gt;
This section describes a specific instance of unnecessary locking.&lt;br /&gt;
&lt;br /&gt;
====Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;====&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
====Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;====&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
====Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;====&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
====Conclusion====&lt;br /&gt;
 Conclusion: we can make a traditional OS architecture scale better (at least to 48 cores) by removing bottlenecks, but hardware will still be a limiting factor to performance.&lt;br /&gt;
&lt;br /&gt;
===Critique===&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content;&lt;br /&gt;
 be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
&lt;br /&gt;
Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
====Content(Fairness): &#039;&#039;Section 5&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
=====memcached: &#039;&#039;Section 5.3&#039;&#039;=====&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that running multiple instances of the server is the same, or nearly the same, as running one larger server [1]. In the end memcached was bottlenecked by the network card.&lt;br /&gt;
&lt;br /&gt;
[1] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
=====Apache: &#039;&#039;Section 5.4&#039;&#039;=====&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware. &#039;&#039;Which is not a problem as the paper specifically states that there are hardware limitations.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
=====gmake: &#039;&#039;Section 5.6&#039;&#039;=====&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
====Style====&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing? Wrong? or use bad methodology?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
===References===&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6310</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6310"/>
		<updated>2010-12-02T15:05:08Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Class and Notices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 6:30pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
- [[I suggest we meet up Thursday morning after Operating Systems in order to discuss and finalize the essay. Maybe we can even designate a lab for the group to meet up in. Any suggestions?]] - Daniel B.&lt;br /&gt;
&lt;br /&gt;
- HP 3115 since there wont be a class in there (as its our tutorial and we know there won&#039;t be anyone there)&lt;br /&gt;
&lt;br /&gt;
- If its all the same to you guys mind if I just join you via msn or iirc? Or phone if you really want.&lt;br /&gt;
&lt;br /&gt;
- I&#039;m working today, but I&#039;ll be at a computer reading this page/contributing to my section. Depending on how busy I am, I should be able to get some significant writing in before 4pm today on my section and any additional sections required. RP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
- I wont be there either. that does not mean i wont/cant contribute. I&#039;ll be on msn or you can just email me. -kirill&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*Background Concepts -fill in info (fii)&lt;br /&gt;
*Contribution -fii&lt;br /&gt;
*Critique -fii&lt;br /&gt;
*References -fii&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* I claim Exim and memcached for background and critique -[[Rannath]]&lt;br /&gt;
* also per-core data structures, false sharing and unessesary locking for contribution -[[Rannath]]&lt;br /&gt;
* For starters I will take the Scalability Tutorial and gmake. Since the part for gmake is short in the paper, I will grab a few more sections later on. - [[Daniel B.]]&lt;br /&gt;
* Also, I will take sloppy counters as well - [[Daniel B.]] &lt;br /&gt;
* I&#039;m gonna put some work into the apache and postgresql sections - kirill&lt;br /&gt;
* Just as a note Anil in class Thuesday the 30th of November said that we only need to explain 3 of the applications and not all 7 - [[Andrew]]&lt;br /&gt;
* I&#039;ll do the Research problem and contribution sections. - [[Andrew]]&lt;br /&gt;
* I will work on contribution - [[Rovic]]&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
===Paper===&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
[http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf The paper: An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
 Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.&lt;br /&gt;
&lt;br /&gt;
 Ideas to explain:&lt;br /&gt;
 - thread (maybe)&lt;br /&gt;
 - Linux&#039;s move towards scalability precedes this paper. (assert this, no explanation needed, maybe a few examples)&lt;br /&gt;
 - Summarize scalability tutorial (Section 4.1 of the paper) focus on what makes something (non-)scalable&lt;br /&gt;
 - Describe the programs tested (what they do, how they&#039;re programmed (serial vs parallel), where to the do their processing)&lt;br /&gt;
&lt;br /&gt;
====Exim: &#039;&#039;Section 3.1&#039;&#039;====&lt;br /&gt;
Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
====memchached: &#039;&#039;Section 3.2&#039;&#039;====&lt;br /&gt;
memcached is an in-memory hash table server. One instance running on many cores is bottlenecked by an internal lock. The MIT team ran multiple instances to avoid the problem. Clients each connect to a single instance. This allows the server to simulate parallelism. With few requests, memcached spends 80% of its time in the kernel on one core, mostly processing packets.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 3.3&#039;&#039;====&lt;br /&gt;
Apache is a web server. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (Making it a perfect example of parallel programming). One thread to service incoming connections and various other threads to service those connections. On a single core processor, Apache spends 60% of its execution time in the kernel.&lt;br /&gt;
&lt;br /&gt;
====PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;====&lt;br /&gt;
As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 3.5&#039;&#039;====&lt;br /&gt;
gmake is an unofficial default benchmark in the Linux community which is used in this paper to build the Linux kernel. gmake is already quite parallel, creating more processes than cores, so that it can make proper use of multiple cores, and involves much reading and writing of files, as it is used to build the Linux kernel. gmake is limited in scalability due to the serial processes that run at the beginning and end of its execution. gmake spends much of its execution time with its compiler, but still spend 7.6% of its time in system time.&lt;br /&gt;
&lt;br /&gt;
====Psearchy: &#039;&#039;Section 3.6&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Metis: &#039;&#039;Section 3.7&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
===Research problem===&lt;br /&gt;
  my references are just below because it is easier for numbering the data later.&lt;br /&gt;
&lt;br /&gt;
As technological progress the number of core a main processor can have is increasing at an impressive rate. Soon personal computers will have so many cores that scalability will be an issue. There has to be a way that standard user level Linux kernel will scale with a 48-core system[1]. The problem with a standard Linux operating is they are not designed for massive scalability which will soon be a problem.  The issue with scalability is that a solo core will perform much more work compared to a single core working with 47 other cores. Although traditional logic that situation makes sense because 48 cores are dividing the work. But when processing information a process the main goal is to finish so as long as possible every core should be doing a much work as possible.&lt;br /&gt;
  &lt;br /&gt;
To fix those scalability issues it is necessary to focus on three major areas: the Linux kernel, user level design and how application use of kernel services. The Linux kernel can be improved be to improve sharing and have the advantage of recent iterations are beginning to implement scalability features. On the user level design applications can be improved so that there is more focus on parallelism since some programs have not implements those improved features. The final aspect of improving scalability is how an application uses kernel services to share resources better so that different aspects of the program are not conflicting over the same services. All of the bottlenecks are found actually only take a little work to avoid.[1]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This research is based on much research which was created before in the development of scalability for UNIX system.  The major developments from shared memory machines [2], wait-free synchronization to fast message passing have created a base set of techniques which can be used to improve scalability. These techniques have been incorporated in all major operation system including Linux, Mac OS X and Windows.  Linux has been improved with kernel subsystems such as Read-Copy-Update which an algorithm for which is used to avoid locks and atomic instructions which lower scalability.[3] The is also an excellent base a research on Linux scalability studies to base this research paper. These paper include a on doing scalability on a 32-core machine. [4] That research can improve the results by learning from the experiments already performed by researchers. This research also aid identifying bottlenecks which speed up researching solutions for those bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[2] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pages 302–313,1994.&lt;br /&gt;
&lt;br /&gt;
[3] P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen, O. Krieger, and R. Russell. Read-copy-update.  In Proceedings of the Linux Symposium 2002, pages 338-367, Ottawa Ontario, June 2002&lt;br /&gt;
&lt;br /&gt;
[4] C. Yan, Y. Chen, and S. Yuanchun. OSMark: A benchmark suite for understanding parallel scalability of operating systems on large scale multi-cores. In 2009 2nd International Conference on Computer Science and Information Technology, pages 313–317, 2009&lt;br /&gt;
&lt;br /&gt;
==Section 4.1 problems:==&lt;br /&gt;
**The percentage of serialization in a program has a lot to do with how much an application can be sped up. As from the example in the paper, it seems to follow Amdahl&#039;s law (e.g. 25% serialization --&amp;gt; limit of 4x speedup).&lt;br /&gt;
**Types of serializing interactions found in the MOSBENCH apps:&lt;br /&gt;
***Locking of shared data structure - increasing # of cores --&amp;gt; increase in lock wait time&lt;br /&gt;
***Writing to shared memory - increasing # of cores --&amp;gt; increase in wait for cache coherence protocol&lt;br /&gt;
***Competing for space in shared hardware cache - increasing # of cores --&amp;gt; increase in cache miss rate&lt;br /&gt;
***Competing for shared hardware resources - increasing # of cores --&amp;gt; increase in wait for resources&lt;br /&gt;
***Not enough tasks for cores --&amp;gt; idle cores&lt;br /&gt;
&lt;br /&gt;
===Contribution===&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
  &lt;br /&gt;
 - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
-I&#039;m just using this as a notepad, do not copy/paste this section, I will put in a properly written set of paragraphs which will fit with the contribution questions asked. -RP&lt;br /&gt;
&lt;br /&gt;
-==Work in Progress==-- -Rovic P.&lt;br /&gt;
This research contributes by evaluating the scalability discrepancies of applications programming and kernel programming. Key discoveries in this research show the effectiveness of the kernel in handling scaling amongst CPU cores. This has also shown that scaling in application programming should be more the focus. It has been shown that simple scaling techniques (list techniques) such as programming parallelism (look up more stuff to back this up and quotes). (Sloppy counter effectiveness, possible positive contributions, what has been used (internet search), what hasn’t been used.) Read conclusion, 2nd paragraph.&lt;br /&gt;
&lt;br /&gt;
One reason the&lt;br /&gt;
required changes are modest is that stock Linux already&lt;br /&gt;
incorporates many modifications to improve scalability.&lt;br /&gt;
More speculatively, perhaps it is the case that Linux’s&lt;br /&gt;
system-call API is well suited to an implementation that&lt;br /&gt;
avoids unnecessary contention over kernel objects.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;====&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counts by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
====Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;====&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
====Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;====&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
====Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;====&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
====Conclusion====&lt;br /&gt;
 Conclusion: we can make a traditional OS architecture scale better (at least to 48 cores) by removing bottlenecks, but hardware will still be a limiting factor to performance.&lt;br /&gt;
&lt;br /&gt;
===Critique===&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content;&lt;br /&gt;
 be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
&lt;br /&gt;
Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
====Content(Fairness): &#039;&#039;Section 5&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
=====memcached: &#039;&#039;Section 5.3&#039;&#039;=====&lt;br /&gt;
memcached is treated with near perfect fairness in the paper. Its an in-memory service, so the ignored storage IO bottleneck does not affect it at all. Likewise the &amp;quot;stock&amp;quot; and &amp;quot;PK&amp;quot; implementations are given the same test suite, so there is no advantage given to either. memcached itself is non-scalable, so the MIT team was forced to run one instance per-core to keep up throughput. The FAQ at memcached.org&#039;s wiki suggests using multiple implementations per-server as a work around to another problem, which implies that running multiple instances of the server is the same, or nearly the same, as running one larger server [1]. In the end memcached was bottlenecked by a flaw in the network card.&lt;br /&gt;
&lt;br /&gt;
[1] memcached&#039;s wiki: http://code.google.com/p/memcached/wiki/FAQ#Can_I_use_different_size_caches_across_servers_and_will_memcache&lt;br /&gt;
&lt;br /&gt;
=====Apache: &#039;&#039;Section 5.4&#039;&#039;=====&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware.&lt;br /&gt;
&lt;br /&gt;
=====gmake: &#039;&#039;Section 5.6&#039;&#039;=====&lt;br /&gt;
Since the inherent nature of gmake makes it quite parallel, the testing and updating that were attempted on gmake resulted in essentially the same scalability results for both the stock and modified kernel. The only change that was found was that gmake spent slightly less time at the system level because of the changes that were made to the system&#039;s caching. As stated in the paper, the execution time of gmake relies quite heavily on the compiler that is uses with gmake, so depending on which compiler was chosen, gmake could run worse or even slightly better. In any case, there seems to be no fairness concerns when it comes to the scalability testing of gmake as the same application load-out was used for all of the tests.&lt;br /&gt;
&lt;br /&gt;
====Style====&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing? Wrong? or use bad methodology?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
===References===&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;br /&gt;
&lt;br /&gt;
gmake:&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/manual/make.html gmake Manual]&lt;br /&gt;
&lt;br /&gt;
[http://www.gnu.org/software/make/ gmake Main Page]&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6012</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6012"/>
		<updated>2010-12-02T00:03:13Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Apache: Section 5.4 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 630pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*Background Concepts -fill in info (fii)&lt;br /&gt;
*Research problem -fii&lt;br /&gt;
*Contribution -fii&lt;br /&gt;
*Critique -fii&lt;br /&gt;
*References -fii&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* I claim Exim and memcached for background and critique -[[Rannath]]&lt;br /&gt;
* also per-core data structures, false sharing and unessesary locking for contribution -[[Rannath]]&lt;br /&gt;
* For starters I will take the Scalability Tutorial and gmake. Since the part for gmake is short in the paper, I will grab a few more sections later on. - [[Daniel B.]]&lt;br /&gt;
* Also, I will take sloppy counters as well - [[Daniel B.]] &lt;br /&gt;
* I&#039;m gonna put some work into the apache and postgresql sections - kirill&lt;br /&gt;
* Just as a note Anil in class Thuesday the 30th of November said that we only need to explain 3 of the applications and not all 7 - [[Andrew]]&lt;br /&gt;
* I&#039;ll do the Research problem and contribution sections. - [[Andrew]]&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
===Paper===&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
[http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf The paper: An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
 Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.&lt;br /&gt;
&lt;br /&gt;
 Ideas to explain:&lt;br /&gt;
 - thread (maybe)&lt;br /&gt;
 - Linux&#039;s move towards scalability precedes this paper. (assert this, no explanation needed, maybe a few examples)&lt;br /&gt;
 - Summarize scalability tutorial (Section 4.1 of the paper) focus on what makes something (non-)scalable&lt;br /&gt;
 - Describe the programs tested (what they do, how they&#039;re programmed (serial vs parallel), where to the do their processing)&lt;br /&gt;
&lt;br /&gt;
====Exim: &#039;&#039;Section 3.1&#039;&#039;====&lt;br /&gt;
Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
====memchached: &#039;&#039;Section 3.2&#039;&#039;====&lt;br /&gt;
memcached is an in-memory hash table. memchached is very much not parallel, but can be made to be, just run multiple instances. Have clients worry about synchronizing data between the different instances. With few requests memcached does most of its processing at the network stack, 80% of its time on one core.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 3.3&#039;&#039;====&lt;br /&gt;
Apache is a web server. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (Making it a perfect example of parallel programming). One thread to service incoming connections and various other threads to service those connections. On a single core processor, Apache spends 60% of its execution time in the kernel.&lt;br /&gt;
&lt;br /&gt;
====PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;====&lt;br /&gt;
As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 3.5&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Psearchy: &#039;&#039;Section 3.6&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Metis: &#039;&#039;Section 3.7&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
===Research problem===&lt;br /&gt;
 What is the research problem being addressed by the paper? How does this problem relate to past related work?&lt;br /&gt;
&lt;br /&gt;
 Problem being addressed: scalability of current generation OS architecture, using Linux as an example. (?)&lt;br /&gt;
&lt;br /&gt;
 Summarize related works (Section 2, include links, expand information to have at least a summary of some related work)&lt;br /&gt;
&lt;br /&gt;
===Contribution===&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
   - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
 Conclusion: we can make a traditional OS architecture scale (at least to 48 cores), we just have to remove bottlenecks.&lt;br /&gt;
&lt;br /&gt;
====Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;====&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counter by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
====Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;====&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
====Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;====&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
====Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;====&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===Critique===&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
&lt;br /&gt;
Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
====Content(Fairness): &#039;&#039;Section 5&#039;&#039;====&lt;br /&gt;
 Fairness criterion:&lt;br /&gt;
 - does the test accurately describe real-world use-cases (or some set there-of)? (external fairness, maybe ignored for testing and benchmarking purposes, usually is too)&lt;br /&gt;
 - does the test put all tested implementations through the same test? (internal fairness)&lt;br /&gt;
Both the stock and new implementations use the same benchmarks, therefore neither of them has a particular advantage. That holds true for all seven programs.&lt;br /&gt;
&lt;br /&gt;
=====Exim: &#039;&#039;Section 5.2&#039;&#039;=====&lt;br /&gt;
The test uses a relatively small number of connections, but that is also implicitly stated to be a non-issue - &amp;quot;as long as there are enough clients to keep Exim busy, the number of clients has little effect on performance.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
This test is explicitly stated to be ignoring the real-world constraint of the IO bottleneck, thus is unfair when compared to real-world scenarios. The purpose was not to test the IO bottleneck. Therefore the unfairness to real-world scenarios is unimportant.&lt;br /&gt;
&lt;br /&gt;
=====memcached: &#039;&#039;Section 5.3&#039;&#039;=====&lt;br /&gt;
memcached has no explicit or implicit fairness concerns with respect to real-world scenarios.&lt;br /&gt;
&lt;br /&gt;
=====Apache: &#039;&#039;Section 5.4&#039;&#039;=====&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. This flaw inherently diminishes the performance of Apache on multi-core system due to multiple threads spread across cores being forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a separate instance on every core listening on different ports which is not a practical real world application but merely an attempt to implement better parallel execution on a traditional kernel. These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed modifications to the kernel wont necessarily produce the same increase in productivity as described in the article. This is very much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware.&lt;br /&gt;
&lt;br /&gt;
=====PostgreSQL: &#039;&#039;Section 5.5&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====gmake: &#039;&#039;Section 5.6&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====Psearchy: &#039;&#039;Section 5.7&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====Metis: &#039;&#039;Section 5.8&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
====Style====&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
===References===&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6011</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6011"/>
		<updated>2010-12-02T00:01:18Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Apache: Section 5.4 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 630pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*Background Concepts -fill in info (fii)&lt;br /&gt;
*Research problem -fii&lt;br /&gt;
*Contribution -fii&lt;br /&gt;
*Critique -fii&lt;br /&gt;
*References -fii&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* I claim Exim and memcached for background and critique -[[Rannath]]&lt;br /&gt;
* also per-core data structures, false sharing and unessesary locking for contribution -[[Rannath]]&lt;br /&gt;
* For starters I will take the Scalability Tutorial and gmake. Since the part for gmake is short in the paper, I will grab a few more sections later on. - [[Daniel B.]]&lt;br /&gt;
* Also, I will take sloppy counters as well - [[Daniel B.]] &lt;br /&gt;
* I&#039;m gonna put some work into the apache and postgresql sections - kirill&lt;br /&gt;
* Just as a note Anil in class Thuesday the 30th of November said that we only need to explain 3 of the applications and not all 7 - [[Andrew]]&lt;br /&gt;
* I&#039;ll do the Research problem and contribution sections. - [[Andrew]]&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
===Paper===&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
[http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf The paper: An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
 Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.&lt;br /&gt;
&lt;br /&gt;
 Ideas to explain:&lt;br /&gt;
 - thread (maybe)&lt;br /&gt;
 - Linux&#039;s move towards scalability precedes this paper. (assert this, no explanation needed, maybe a few examples)&lt;br /&gt;
 - Summarize scalability tutorial (Section 4.1 of the paper) focus on what makes something (non-)scalable&lt;br /&gt;
 - Describe the programs tested (what they do, how they&#039;re programmed (serial vs parallel), where to the do their processing)&lt;br /&gt;
&lt;br /&gt;
====Exim: &#039;&#039;Section 3.1&#039;&#039;====&lt;br /&gt;
Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
====memchached: &#039;&#039;Section 3.2&#039;&#039;====&lt;br /&gt;
memcached is an in-memory hash table. memchached is very much not parallel, but can be made to be, just run multiple instances. Have clients worry about synchronizing data between the different instances. With few requests memcached does most of its processing at the network stack, 80% of its time on one core.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 3.3&#039;&#039;====&lt;br /&gt;
Apache is a web server. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (Making it a perfect example of parallel programming). One thread to service incoming connections and various other threads to service those connections. On a single core processor, Apache spends 60% of its execution time in the kernel.&lt;br /&gt;
&lt;br /&gt;
====PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;====&lt;br /&gt;
As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 3.5&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Psearchy: &#039;&#039;Section 3.6&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Metis: &#039;&#039;Section 3.7&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
===Research problem===&lt;br /&gt;
 What is the research problem being addressed by the paper? How does this problem relate to past related work?&lt;br /&gt;
&lt;br /&gt;
 Problem being addressed: scalability of current generation OS architecture, using Linux as an example. (?)&lt;br /&gt;
&lt;br /&gt;
 Summarize related works (Section 2, include links, expand information to have at least a summary of some related work)&lt;br /&gt;
&lt;br /&gt;
===Contribution===&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
   - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
 Conclusion: we can make a traditional OS architecture scale (at least to 48 cores), we just have to remove bottlenecks.&lt;br /&gt;
&lt;br /&gt;
====Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;====&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counter by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
====Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;====&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
====Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;====&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
====Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;====&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===Critique===&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
&lt;br /&gt;
Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
====Content(Fairness): &#039;&#039;Section 5&#039;&#039;====&lt;br /&gt;
 Fairness criterion:&lt;br /&gt;
 - does the test accurately describe real-world use-cases (or some set there-of)? (external fairness, maybe ignored for testing and benchmarking purposes, usually is too)&lt;br /&gt;
 - does the test put all tested implementations through the same test? (internal fairness)&lt;br /&gt;
Both the stock and new implementations use the same benchmarks, therefore neither of them has a particular advantage. That holds true for all seven programs.&lt;br /&gt;
&lt;br /&gt;
=====Exim: &#039;&#039;Section 5.2&#039;&#039;=====&lt;br /&gt;
The test uses a relatively small number of connections, but that is also implicitly stated to be a non-issue - &amp;quot;as long as there are enough clients to keep Exim busy, the number of clients has little effect on performance.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
This test is explicitly stated to be ignoring the real-world constraint of the IO bottleneck, thus is unfair when compared to real-world scenarios. The purpose was not to test the IO bottleneck. Therefore the unfairness to real-world scenarios is unimportant.&lt;br /&gt;
&lt;br /&gt;
=====memcached: &#039;&#039;Section 5.3&#039;&#039;=====&lt;br /&gt;
memcached has no explicit or implicit fairness concerns with respect to real-world scenarios.&lt;br /&gt;
&lt;br /&gt;
=====Apache: &#039;&#039;Section 5.4&#039;&#039;=====&lt;br /&gt;
Linux has a built in kernel flaw where network packets are forced to travel though multiple queues before they arrive at queue &lt;br /&gt;
&lt;br /&gt;
where they can be processed by the application. This imposes significant costs on multi-core systems due to queue locking costs. &lt;br /&gt;
&lt;br /&gt;
This flaw inherentaly dimishes the performance of Apache on multi-core system due to multiple threads spread across cores being &lt;br /&gt;
&lt;br /&gt;
forced to deal with these mutex (mutual exclusion) costs. For the sake of this experiment Apache had a seperate instance on &lt;br /&gt;
&lt;br /&gt;
every core listening on different ports which is not a practical real world application but mearly an attempt to implement &lt;br /&gt;
&lt;br /&gt;
better parallel execution on a traditional kernel.&lt;br /&gt;
&lt;br /&gt;
These tests were also rigged to avoid bottlenecks in place by network and file storage hardware. Meaning, making the proposed &lt;br /&gt;
&lt;br /&gt;
modifications to the kernel wont necessarly produce the same increase in productivity as described in the article. This is very &lt;br /&gt;
&lt;br /&gt;
much evident in the test where performance degrades past 36 cores due to limitation of the networking hardware.&lt;br /&gt;
&lt;br /&gt;
=====PostgreSQL: &#039;&#039;Section 5.5&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====gmake: &#039;&#039;Section 5.6&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====Psearchy: &#039;&#039;Section 5.7&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====Metis: &#039;&#039;Section 5.8&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
====Style====&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
===References===&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6009</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=6009"/>
		<updated>2010-12-01T23:57:01Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Apache: Section 3.3 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 630pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*Background Concepts -fill in info (fii)&lt;br /&gt;
*Research problem -fii&lt;br /&gt;
*Contribution -fii&lt;br /&gt;
*Critique -fii&lt;br /&gt;
*References -fii&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* I claim Exim and memcached for background and critique -[[Rannath]]&lt;br /&gt;
* also per-core data structures, false sharing and unessesary locking for contribution -[[Rannath]]&lt;br /&gt;
* For starters I will take the Scalability Tutorial and gmake. Since the part for gmake is short in the paper, I will grab a few more sections later on. - [[Daniel B.]]&lt;br /&gt;
* Also, I will take sloppy counters as well - [[Daniel B.]] &lt;br /&gt;
* I&#039;m gonna put some work into the apache and postgresql sections - kirill&lt;br /&gt;
* Just as a note Anil in class Thuesday the 30th of November said that we only need to explain 3 of the applications and not all 7 - [[Andrew]]&lt;br /&gt;
* I&#039;ll do the Research problem and contribution sections. - [[Andrew]]&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
===Paper===&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
[http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf The paper: An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
 Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.&lt;br /&gt;
&lt;br /&gt;
 Ideas to explain:&lt;br /&gt;
 - thread (maybe)&lt;br /&gt;
 - Linux&#039;s move towards scalability precedes this paper. (assert this, no explanation needed, maybe a few examples)&lt;br /&gt;
 - Summarize scalability tutorial (Section 4.1 of the paper) focus on what makes something (non-)scalable&lt;br /&gt;
 - Describe the programs tested (what they do, how they&#039;re programmed (serial vs parallel), where to the do their processing)&lt;br /&gt;
&lt;br /&gt;
====Exim: &#039;&#039;Section 3.1&#039;&#039;====&lt;br /&gt;
Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
====memchached: &#039;&#039;Section 3.2&#039;&#039;====&lt;br /&gt;
memcached is an in-memory hash table. memchached is very much not parallel, but can be made to be, just run multiple instances. Have clients worry about synchronizing data between the different instances. With few requests memcached does most of its processing at the network stack, 80% of its time on one core.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 3.3&#039;&#039;====&lt;br /&gt;
Apache is a web server. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads (Making it a perfect example of parallel programming). One thread to service incoming connections and various other threads to service those connections. On a single core processor, Apache spends 60% of its execution time in the kernel.&lt;br /&gt;
&lt;br /&gt;
====PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;====&lt;br /&gt;
As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 3.5&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Psearchy: &#039;&#039;Section 3.6&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Metis: &#039;&#039;Section 3.7&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
===Research problem===&lt;br /&gt;
 What is the research problem being addressed by the paper? How does this problem relate to past related work?&lt;br /&gt;
&lt;br /&gt;
 Problem being addressed: scalability of current generation OS architecture, using Linux as an example. (?)&lt;br /&gt;
&lt;br /&gt;
 Summarize related works (Section 2, include links, expand information to have at least a summary of some related work)&lt;br /&gt;
&lt;br /&gt;
===Contribution===&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
   - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
 Conclusion: we can make a traditional OS architecture scale (at least to 48 cores), we just have to remove bottlenecks.&lt;br /&gt;
&lt;br /&gt;
====Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;====&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counter by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
====Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;====&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
====Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;====&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
====Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;====&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===Critique===&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
&lt;br /&gt;
Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
====Content(Fairness): &#039;&#039;Section 5&#039;&#039;====&lt;br /&gt;
 Fairness criterion:&lt;br /&gt;
 - does the test accurately describe real-world use-cases (or some set there-of)? (external fairness, maybe ignored for testing and benchmarking purposes, usually is too)&lt;br /&gt;
 - does the test put all tested implementations through the same test? (internal fairness)&lt;br /&gt;
Both the stock and new implementations use the same benchmarks, therefore neither of them has a particular advantage. That holds true for all seven programs.&lt;br /&gt;
&lt;br /&gt;
=====Exim: &#039;&#039;Section 5.2&#039;&#039;=====&lt;br /&gt;
The test uses a relatively small number of connections, but that is also implicitly stated to be a non-issue - &amp;quot;as long as there are enough clients to keep Exim busy, the number of clients has little effect on performance.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
This test is explicitly stated to be ignoring the real-world constraint of the IO bottleneck, thus is unfair when compared to real-world scenarios. The purpose was not to test the IO bottleneck. Therefore the unfairness to real-world scenarios is unimportant.&lt;br /&gt;
&lt;br /&gt;
=====memcached: &#039;&#039;Section 5.3&#039;&#039;=====&lt;br /&gt;
memcached has no explicit or implicit fairness concerns with respect to real-world scenarios.&lt;br /&gt;
&lt;br /&gt;
=====Apache: &#039;&#039;Section 5.4&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====PostgreSQL: &#039;&#039;Section 5.5&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====gmake: &#039;&#039;Section 5.6&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====Psearchy: &#039;&#039;Section 5.7&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====Metis: &#039;&#039;Section 5.8&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
====Style====&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
===References===&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=5997</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=5997"/>
		<updated>2010-12-01T22:27:52Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 630pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont [mailto:dan.sont@gmail.com dan.sont@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*Background Concepts -fill in info (fii)&lt;br /&gt;
*Research problem -fii&lt;br /&gt;
*Contribution -fii&lt;br /&gt;
*Critique -fii&lt;br /&gt;
*References -fii&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* I claim Exim and memcached for background and critique -[[Rannath]]&lt;br /&gt;
* also per-core data structures, false sharing and unessesary locking for contribution -[[Rannath]]&lt;br /&gt;
* For starters I will take the Scalability Tutorial and gmake. Since the part for gmake is short in the paper, I will grab a few more sections later on. - [[Daniel B.]]&lt;br /&gt;
* Also, I will take sloppy counters as well - [[Daniel B.]] &lt;br /&gt;
* I&#039;m gonna put some work into the apache and postgresql sections - kirill&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
===Paper===&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
[http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf The paper: An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
 Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.&lt;br /&gt;
&lt;br /&gt;
 Ideas to explain:&lt;br /&gt;
 - thread (maybe)&lt;br /&gt;
 - Linux&#039;s move towards scalability precedes this paper. (assert this, no explanation needed, maybe a few examples)&lt;br /&gt;
 - Summarize scalability tutorial (Section 4.1 of the paper) focus on what makes something (non-)scalable&lt;br /&gt;
 - Describe the programs tested (what they do, how they&#039;re programmed (serial vs parallel), where to the do their processing)&lt;br /&gt;
&lt;br /&gt;
====Exim: &#039;&#039;Section 3.1&#039;&#039;====&lt;br /&gt;
Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
====memchached: &#039;&#039;Section 3.2&#039;&#039;====&lt;br /&gt;
memcached is an in-memory hash table. memchached is very much not parallel, but can be made to be, just run multiple instances. Have clients worry about synchronizing data between the different instances. With few requests memcached does most of its processing at the network stack, 80% of its time on one core.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 3.3&#039;&#039;====&lt;br /&gt;
Apache is a web server. In the case of this study, Apache has been configured to run a separate process on each core. Each process, in turn, has multiple threads. Making it a perfect example of parallel programming. One thread to service incoming connections and various other threads to service those connections. On a single core processor, Apache spends 60% of its execution time in the kernel.&lt;br /&gt;
&lt;br /&gt;
====PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;====&lt;br /&gt;
As implied by the name PostgreSQL is a SQL database. PostgreSQL starts a separate process for each connection and uses kernel locking interfaces  extensively to provide concurrent access to the database. Due to bottlenecks introduced in its code and in the kernel code, the amount of time PostgreSQL spends in the kernel increases very rapidly with addition of new cores. On a single core system PostgreSQL spends only 1.5% of its time in the kernel. On a 48 core system the execution time in the kernel jumps to 82%.&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 3.5&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Psearchy: &#039;&#039;Section 3.6&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Metis: &#039;&#039;Section 3.7&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
===Research problem===&lt;br /&gt;
 What is the research problem being addressed by the paper? How does this problem relate to past related work?&lt;br /&gt;
&lt;br /&gt;
 Problem being addressed: scalability of current generation OS architecture, using Linux as an example. (?)&lt;br /&gt;
&lt;br /&gt;
 Summarize related works (Section 2, include links, expand information to have at least a summary of some related work)&lt;br /&gt;
&lt;br /&gt;
===Contribution===&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
   - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
 Conclusion: we can make a traditional OS architecture scale (at least to 48 cores), we just have to remove bottlenecks.&lt;br /&gt;
&lt;br /&gt;
====Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;====&lt;br /&gt;
Bottlenecks were encountered when the applications undergoing testing were referencing and updating shared counters for multiple cores. The solution in the paper is to use sloppy counters, which gets each core to track its own separate counts of references and uses a central shared counter to keep all counts on track. This is ideal because each core updates its counter by modifying its per-core counter, usually only needing access to its own local cache, cutting down on waiting for locks or serialization. Sloppy counters are also backwards-compatible with existing shared-counter code, making its implementation much easier to accomplish. The main disadvantages of the sloppy counters are that in situations where object de-allocation occurs often, because the de-allocation itself is an expensive operation, and the counters use up space proportional to the number of cores.&lt;br /&gt;
&lt;br /&gt;
====Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;====&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
====Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;====&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
====Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;====&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===Critique===&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
&lt;br /&gt;
Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
====Content(Fairness): &#039;&#039;Section 5&#039;&#039;====&lt;br /&gt;
 Fairness criterion:&lt;br /&gt;
 - does the test accurately describe real-world use-cases (or some set there-of)? (external fairness, maybe ignored for testing and benchmarking purposes, usually is too)&lt;br /&gt;
 - does the test put all tested implementations through the same test? (internal fairness)&lt;br /&gt;
Both the stock and new implementations use the same benchmarks, therefore neither of them has a particular advantage. That holds true for all seven programs.&lt;br /&gt;
&lt;br /&gt;
=====Exim: &#039;&#039;Section 5.2&#039;&#039;=====&lt;br /&gt;
The test uses a relatively small number of connections, but that is also implicitly stated to be a non-issue - &amp;quot;as long as there are enough clients to keep Exim busy, the number of clients has little effect on performance.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
This test is explicitly stated to be ignoring the real-world constraint of the IO bottleneck, thus is unfair when compared to real-world scenarios. The purpose was not to test the IO bottleneck. Therefore the unfairness to real-world scenarios is unimportant.&lt;br /&gt;
&lt;br /&gt;
=====memcached: &#039;&#039;Section 5.3&#039;&#039;=====&lt;br /&gt;
memcached has no explicit or implicit fairness concerns with respect to real-world scenarios.&lt;br /&gt;
&lt;br /&gt;
=====Apache: &#039;&#039;Section 5.4&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====PostgreSQL: &#039;&#039;Section 5.5&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====gmake: &#039;&#039;Section 5.6&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====Psearchy: &#039;&#039;Section 5.7&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====Metis: &#039;&#039;Section 5.8&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
====Style====&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
===References===&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=5803</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=5803"/>
		<updated>2010-11-30T21:11:58Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Claim Sections */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Class and Notices=&lt;br /&gt;
(Nov. 30, 2010) Prof. Anil stated that we should focus on the 3 easiest to understand parts in section 5 and elaborate on them.&lt;br /&gt;
&lt;br /&gt;
- Also, I, Daniel B., work Thursday night, so I will be finishing up as much of my part as I can for the essay before Thursday morning&#039;s class, then maybe we can all meet up in a lab in HP and put the finishing touches on the essay. I will be available online Wednesday night from about 630pm onwards and will be in the game dev lab or CCSS lounge Wednesday morning from about 11am to 2pm if anyone would like to meet up with me at those times.&lt;br /&gt;
&lt;br /&gt;
=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Rovic Perdon [mailto:rperdon@gmail.com rperdon@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Sont ?&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section &amp;amp; add links to supplementary info&lt;br /&gt;
*Background Concepts -fill in info (fii)&lt;br /&gt;
*Research problem -fii&lt;br /&gt;
*Contribution -fii&lt;br /&gt;
*Critique -fii&lt;br /&gt;
*References -fii&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* I claim Exim and memcached for background and critique -[[Rannath]]&lt;br /&gt;
* also per-core data structures, false sharing and unessesary locking for contribution -[[Rannath]]&lt;br /&gt;
* For starters I will take the Scalability Tutorial and gmake. Since the part for gmake is short in the paper, I will grab a few more sections later on. - [[Daniel B.]]&lt;br /&gt;
* I&#039;m gonna put some work into the apache and postgresql sections - kirill&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
===Paper===&lt;br /&gt;
This paper was authored by - Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.&lt;br /&gt;
&lt;br /&gt;
They all work at MIT CSAIL.&lt;br /&gt;
&lt;br /&gt;
[http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf The paper: An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
 Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.&lt;br /&gt;
&lt;br /&gt;
 Ideas to explain:&lt;br /&gt;
 - thread (maybe)&lt;br /&gt;
 - Linux&#039;s move towards scalability precedes this paper. (assert this, no explanation needed, maybe a few examples)&lt;br /&gt;
 - Summarize scalability tutorial (Section 4.1 of the paper) focus on what makes something (non-)scalable&lt;br /&gt;
 - Describe the programs tested (what they do, how they&#039;re programmed (serial vs parallel), where to the do their processing)&lt;br /&gt;
&lt;br /&gt;
====Exim: &#039;&#039;Section 3.1&#039;&#039;====&lt;br /&gt;
Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
====memchached: &#039;&#039;Section 3.2&#039;&#039;====&lt;br /&gt;
memcached is an in-memory hash table. memchached is very much not parallel, but can be made to be, just run multiple instances. Have clients worry about synchronizing data between the different instances. With few requests memcached does most of its processing at the network stack, 80% of its time on one core.&lt;br /&gt;
&lt;br /&gt;
====Apache: &#039;&#039;Section 3.3&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====PostgreSQL: &#039;&#039;Section 3.4&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====gmake: &#039;&#039;Section 3.5&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Psearchy: &#039;&#039;Section 3.6&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Metis: &#039;&#039;Section 3.7&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
===Research problem===&lt;br /&gt;
 What is the research problem being addressed by the paper? How does this problem relate to past related work?&lt;br /&gt;
&lt;br /&gt;
 Problem being addressed: scalability of current generation OS architecture, using Linux as an example. (?)&lt;br /&gt;
&lt;br /&gt;
 Summarize related works (Section 2, include links, expand information to have at least a summary of some related work)&lt;br /&gt;
&lt;br /&gt;
===Contribution===&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
 Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
   - So long as we cite the paper and don&#039;t pretend the graphs are ours, we are ok, since we are writing an explanation/critic of the paper.&lt;br /&gt;
&lt;br /&gt;
 Conclusion: we can make a traditional OS architecture scale (at least to 48 cores), we just have to remove bottlenecks.&lt;br /&gt;
&lt;br /&gt;
====Multicore packet processing: &#039;&#039;Section 4.2&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Sloppy counters: &#039;&#039;Section 4.3&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Lock-free comparison: &#039;&#039;Section 4.4&#039;&#039;====&lt;br /&gt;
&lt;br /&gt;
====Per-Core Data Structures: &#039;&#039;Section 4.5&#039;&#039;====&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
====Eliminating false sharing: &#039;&#039;Section 4.6&#039;&#039;====&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
====Avoiding unnecessary locking: &#039;&#039;Section 4.7&#039;&#039;====&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===Critique===&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
&lt;br /&gt;
Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
====Content(Fairness): &#039;&#039;Section 5&#039;&#039;====&lt;br /&gt;
 Fairness criterion:&lt;br /&gt;
 - does the test accurately describe real-world use-cases (or some set there-of)? (external fairness, maybe ignored for testing and benchmarking purposes, usually is too)&lt;br /&gt;
 - does the test put all tested implementations through the same test? (internal fairness)&lt;br /&gt;
Both the stock and new implementations use the same benchmarks, therefore neither of them has a particular advantage. That holds true for all seven programs.&lt;br /&gt;
&lt;br /&gt;
=====Exim: &#039;&#039;Section 5.2&#039;&#039;=====&lt;br /&gt;
The test uses a relatively small number of connections, but that is also implicitly stated to be a non-issue - &amp;quot;as long as there are enough clients to keep Exim busy, the number of clients has little effect on performance.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
This test is explicitly stated to be ignoring the real-world constraint of the IO bottleneck, thus is unfair when compared to real-world scenarios. The purpose was not to test the IO bottleneck. Therefore the unfairness to real-world scenarios is unimportant.&lt;br /&gt;
&lt;br /&gt;
=====memcached: &#039;&#039;Section 5.3&#039;&#039;=====&lt;br /&gt;
memcached has no explicit or implicit fairness concerns with respect to real-world scenarios.&lt;br /&gt;
&lt;br /&gt;
=====Apache: &#039;&#039;Section 5.4&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====PostgreSQL: &#039;&#039;Section 5.5&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====gmake: &#039;&#039;Section 5.6&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====Psearchy: &#039;&#039;Section 5.7&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
=====Metis: &#039;&#039;Section 5.8&#039;&#039;=====&lt;br /&gt;
&lt;br /&gt;
====Style====&lt;br /&gt;
 Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
 - does the paper present information out of order?&lt;br /&gt;
 - does the paper present needless information?&lt;br /&gt;
 - does the paper have any sections that are inherently confusing?&lt;br /&gt;
 - is the paper easy to read through, or does it change subjects repeatedly?&lt;br /&gt;
 - does the paper have too many &amp;quot;long-winded&amp;quot; sentences, making it seem like the authors are just trying to add extra words to make it seem more important? - I think maybe limit this to run-on sentences.&lt;br /&gt;
 - Check for grammar&lt;br /&gt;
&lt;br /&gt;
===References===&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=5400</id>
		<title>Talk:COMP 3000 Essay 2 2010 Question 1</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_2_2010_Question_1&amp;diff=5400"/>
		<updated>2010-11-22T22:32:53Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Group members=&lt;br /&gt;
Patrick Young [mailto:Rannath@gmail.com Rannath@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Daniel Beimers [mailto:demongyro@gmail.com demongyro@gmail.com]&lt;br /&gt;
&lt;br /&gt;
Andrew Bown [mailto:abown2@connect.carleton.ca abown2@connect.carleton.ca]&lt;br /&gt;
&lt;br /&gt;
Kirill Kashigin [mailto:k.kashigin@gmail.com k.kashigin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
=Methodology=&lt;br /&gt;
We should probably have our work verified by at least one group member before posting to the actual page&lt;br /&gt;
&lt;br /&gt;
=To Do=&lt;br /&gt;
*Improve the grammar/structure of the paper section&lt;br /&gt;
*Background Concepts -fill in info (fii)&lt;br /&gt;
*Research problem -fii&lt;br /&gt;
*Contribution -fii&lt;br /&gt;
*Critique -fii&lt;br /&gt;
*References -fii&lt;br /&gt;
&lt;br /&gt;
===Claim Sections===&lt;br /&gt;
* I claim Exim and memcached for background and critique -[[Rannath]]&lt;br /&gt;
* also per-core data structures, false sharing and unessesary locking for contribution -[[Rannath]]&lt;br /&gt;
&lt;br /&gt;
=Essay=&lt;br /&gt;
===Paper===&lt;br /&gt;
 The paper&#039;s title, authors, and their affiliations. Include a link to the paper and any particularly helpful supplementary information.&lt;br /&gt;
&lt;br /&gt;
Authors in order presented: Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich&lt;br /&gt;
&lt;br /&gt;
affiliation: MIT CSAIL&lt;br /&gt;
&lt;br /&gt;
[http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf An Analysis of Linux Scalability to Many Cores]&lt;br /&gt;
&lt;br /&gt;
===Background Concepts===&lt;br /&gt;
 Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.&lt;br /&gt;
&lt;br /&gt;
Ideas to explain:&lt;br /&gt;
#thread (maybe)&lt;br /&gt;
#Linux&#039;s move towards scalability precedes this paper. (assert this, no explanation needed, maybe a few examples)&lt;br /&gt;
#Summarize scalability tutorial (Section 4.1 of the paper)&lt;br /&gt;
#Describe the programs tested (what they do, how they&#039;re programmed (serial vs parallel), where to the do their processing)&lt;br /&gt;
&lt;br /&gt;
=====Exim: &#039;&#039;Section 3.1&#039;&#039;=====&lt;br /&gt;
Exim is a mail server for Unix. It&#039;s fairly parallel. The server forks a new process for each connection and twice to deliver each message. It spends 69% of its time in the kernel on a single core.&lt;br /&gt;
&lt;br /&gt;
=====memchached: &#039;&#039;Section 3.2&#039;&#039;=====&lt;br /&gt;
memcached is an in-memory hash table. memchached is very much not parallel, but can be made to be, just run multiple instances. Have clients worry about synchronizing data between the different instances. With few requests memcached does most of its processing at the network stack, 80% of its time on one core.&lt;br /&gt;
&lt;br /&gt;
===Research problem===&lt;br /&gt;
 What is the research problem being addressed by the paper? How does this problem relate to past related work?&lt;br /&gt;
&lt;br /&gt;
Problem being addressed: scalability of current generation OS architecture, using Linux as an example. (?)&lt;br /&gt;
&lt;br /&gt;
Summarize related works (Section 2, include links, expand information to have at least a summary of some related work)&lt;br /&gt;
&lt;br /&gt;
===Contribution===&lt;br /&gt;
 What was implemented? Why is it any better than what came before?&lt;br /&gt;
&lt;br /&gt;
Summarize info from Section 4.2 onwards, maybe put graphs from Section 5 here to provide support for improvements (if that isn&#039;t unethical/illegal)?&lt;br /&gt;
&lt;br /&gt;
Conclusion: we can make a traditional OS architecture scale (at least to 48 cores), we just have to remove bottlenecks.&lt;br /&gt;
&lt;br /&gt;
=====Per-Core Data Structures=====&lt;br /&gt;
Three centralized data structures were causing bottlenecks - a per-superblock list of open files, vfsmount table, the packet buffers free list. Each data structure was decentralized to per-core versions of itself. In the case of vfsmount the central data structure was maintained, and any per-core misses got written from the central table to the per-core table.&lt;br /&gt;
&lt;br /&gt;
=====Eliminating false sharing=====&lt;br /&gt;
Misplaced variables on the cache cause different cores to request the same line to be read and written at the same time often enough to significantly impact performance. By moving the often written variable to another line the bottleneck was removed.&lt;br /&gt;
&lt;br /&gt;
=====Avoiding unnecessary locking=====&lt;br /&gt;
Many locks/mutexes have special cases where they don&#039;t need to lock. Likewise mutexes can be split from locking the whole data structure to locking a part of it. Both these changes remove or reduce bottlenecks.&lt;br /&gt;
&lt;br /&gt;
===Critique===&lt;br /&gt;
 What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.&lt;br /&gt;
&lt;br /&gt;
Since this is a &amp;quot;my implementation is better then your implementation&amp;quot; paper the &amp;quot;goodness&amp;quot; of content can be impartially determined by its fairness and the honesty of the authors.&lt;br /&gt;
&lt;br /&gt;
Fairness criterion:&lt;br /&gt;
#does the test accurately describe real-world use-cases (or some set there-of)? (external fairness, maybe ignored for testing and benchmarking purposes, usually is too)&lt;br /&gt;
#does the test put all tested implementations through the same test? (internal fairness)&lt;br /&gt;
&lt;br /&gt;
Style Criterion (feel free to add I have no idea what should go here):&lt;br /&gt;
#does the paper present information out of order?&lt;br /&gt;
#does the paper present needless information?&lt;br /&gt;
#does the paper have any sections that are inherently confusing?&lt;br /&gt;
&lt;br /&gt;
=====Testing Method: &#039;&#039;Section 5&#039;&#039;=====&lt;br /&gt;
Both the stock and new implementations use the same benchmarks, therefore internal fairness is preserved for all seven programs.&lt;br /&gt;
&lt;br /&gt;
=====Exim: &#039;&#039;Section 5.2&#039;&#039;=====&lt;br /&gt;
The test uses a relatively small number of connections, but that is also implicitly stated to be a non-issue - &amp;quot;as long as there are enough clients to keep Exim busy, the number of clients has little effect on performance.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
This test is explicitly stated to be ignoring the real-world constraint of the IO bottleneck, thus is unfair when compared to real-world scenarios. The purpose was not to test the IO bottleneck. Therefore the unfairness to real-world scenarios is unimportant.&lt;br /&gt;
&lt;br /&gt;
=====memcached: &#039;&#039;Section 5.3&#039;&#039;=====&lt;br /&gt;
memcached has no explicit or implicit fairness concerns with respect to real-world scenarios.&lt;br /&gt;
&lt;br /&gt;
===References===&lt;br /&gt;
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_10&amp;diff=3610</id>
		<title>COMP 3000 Essay 1 2010 Question 10</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_10&amp;diff=3610"/>
		<updated>2010-10-14T04:31:07Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Traditionally Optimized File Systems */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How do the constraints of flash storage affect the design of flash-optimized file systems? Explain by contrasting with hard disk-based file systems.&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&#039;&#039;an introduction goes here&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Flash Memory==&lt;br /&gt;
&#039;&#039;basic info on flash memory&#039;&#039;&lt;br /&gt;
===Constraints===&lt;br /&gt;
&#039;&#039;--&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Hard Disk drives==&lt;br /&gt;
&#039;&#039;basic info on HDDs&#039;&#039;&lt;br /&gt;
===Constraints===&lt;br /&gt;
&#039;&#039;--&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Traditionally Optimized File Systems==&lt;br /&gt;
&#039;&#039;things traditional files systems do to optimize HDD read/write/etc&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
	Most conventional file systems are designed to me implemented on hard disk drives. This fact does not mean they cannot be implemented on a solid state drive (file storage that uses flash memory instead of magnetic discs). It would however, in many ways, defeat the purpose of using flash memory. The most consuming process for an HDD is seeking data by relocating the read-head and spinning the magnetic disk. A traditional file system optimizes the way it stores data by placing related blocks close-by on the disk to minimize mechanical movement within the HDD. One of the great advantages of flash memory, which accounts for its fast read speed, is that there is no need to seek data physically so there is no need to waste resources laying out the data in close proximity.&lt;br /&gt;
	A traditional HDD file system will also attempt to defragment itself, moving blocks of data around for closer proximity on the magnetic disk. This process, although beneficial for HDD&#039;s, is harmful and inefficient for flash based storage. A flash optimal file system needs to reduce the amount of erase operations, since flash memory only has a limited amount of erase cycles as well as having very slow erase speeds.&lt;br /&gt;
	When an HDD rewrites data to a physical location there is no need for it to erase the previously occupying data first, so a traditional disk based file system doesn&#039;t worry about erasing data from unused memory blocks. In contrast flash memory needs to first erase the data block before it can modify any of it contents. Since the erase procedure is extremely slow, its not practical to overwrite old data every time. It is also decremental to the life span of flash memory.&lt;br /&gt;
	To maximize the potential of flash based memory the file system would have to write new data to empty memory blocks. This method would also call for some sort of garbage collection to erase unused blocks when the system is idle, which does not get implemented in conventional file systems since it is not needed.&lt;br /&gt;
&lt;br /&gt;
==Flash Optimized File Systems==&lt;br /&gt;
&#039;&#039;Flash Optimized files systems do to optimize HDD read/write/etc&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Similarities==&lt;br /&gt;
&#039;&#039;this could probably be titled better&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Differences==&lt;br /&gt;
&#039;&#039;ditto for this one&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
=External links=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Hey guys,&lt;br /&gt;
&lt;br /&gt;
This is what I&#039;ve got so far... mostly based on wikipedia:&lt;br /&gt;
&lt;br /&gt;
Flash memory has two limitations: it can only be erased in blocks and and it wears out after a certain number of erase cycles. Furthermore, a particular kind of Flash memory (NAND) is not able to provide random access.  &lt;br /&gt;
As a result of these Flash based file-systems cannot be handled in the same way as disk-based file systems. Here are a few of the key differences:&lt;br /&gt;
&lt;br /&gt;
-	Because memory must be erased in blocks, its erasure tends to take up time. Consequently, it is necessary to time the erasures in a way so as not to interfere with the efficiency of the system’s other operations. This is is not a real concern with disk-based file-systems. &lt;br /&gt;
-	A disk file-system needs to minimize the seeking time, but Flash file-system does not concern itself with this as it doesn’t have a disk. &lt;br /&gt;
-	A flash system tries to distribute memory in such a way so as not to make a particular block of memory subject to a disproportionally large number of erasures. The purpose of this is to keep the block from wearing out prematurely. The result of it is that memory needs to be distributed differently than in a disk based file-system. &lt;br /&gt;
Log-sturctured file systems are thus best suited to dealing with flash memory (they apparently do all of the above things). &lt;br /&gt;
&lt;br /&gt;
For the essay form, I&#039;m thinking of doing a section about traditional hard-disk systems, another about flash-memory and a third about flash systems. At this point, I am imagining the thesis as something like, &amp;quot;Flash systems require a fundamentally different system architecture than disk-based systems due to their need to adapt to the constraints inherent in flash memory: specifically, due to that memory&#039;s limited life-span and block-based erasures.&amp;quot; The argument would then talk about how these two differences directly lead to a new FS approach. &lt;br /&gt;
&lt;br /&gt;
That&#039;s how I see it at the moment. Honestly, I don&#039;t like doing research about this kind of stuff, so my data isn&#039;t very deep. That said, if you guys could find more info and summarize it, I&#039;m pretty sure that I could synthesize it all into a coherent essay. &lt;br /&gt;
&lt;br /&gt;
Fedor&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_10&amp;diff=3609</id>
		<title>COMP 3000 Essay 1 2010 Question 10</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_10&amp;diff=3609"/>
		<updated>2010-10-14T04:30:48Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Traditionally Optimized File Systems */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How do the constraints of flash storage affect the design of flash-optimized file systems? Explain by contrasting with hard disk-based file systems.&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&#039;&#039;an introduction goes here&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Flash Memory==&lt;br /&gt;
&#039;&#039;basic info on flash memory&#039;&#039;&lt;br /&gt;
===Constraints===&lt;br /&gt;
&#039;&#039;--&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Hard Disk drives==&lt;br /&gt;
&#039;&#039;basic info on HDDs&#039;&#039;&lt;br /&gt;
===Constraints===&lt;br /&gt;
&#039;&#039;--&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Traditionally Optimized File Systems==&lt;br /&gt;
&#039;&#039;things traditional files systems do to optimize HDD read/write/etc&#039;&#039;&lt;br /&gt;
&amp;lt;nowiki&amp;gt;&lt;br /&gt;
	Most conventional file systems are designed to me implemented on hard disk drives. This fact does not mean they cannot be implemented on a solid state drive (file storage that uses flash memory instead of magnetic discs). It would however, in many ways, defeat the purpose of using flash memory. The most consuming process for an HDD is seeking data by relocating the read-head and spinning the magnetic disk. A traditional file system optimizes the way it stores data by placing related blocks close-by on the disk to minimize mechanical movement within the HDD. One of the great advantages of flash memory, which accounts for its fast read speed, is that there is no need to seek data physically so there is no need to waste resources laying out the data in close proximity.&lt;br /&gt;
	A traditional HDD file system will also attempt to defragment itself, moving blocks of data around for closer proximity on the magnetic disk. This process, although beneficial for HDD&#039;s, is harmful and inefficient for flash based storage. A flash optimal file system needs to reduce the amount of erase operations, since flash memory only has a limited amount of erase cycles as well as having very slow erase speeds.&lt;br /&gt;
	When an HDD rewrites data to a physical location there is no need for it to erase the previously occupying data first, so a traditional disk based file system doesn&#039;t worry about erasing data from unused memory blocks. In contrast flash memory needs to first erase the data block before it can modify any of it contents. Since the erase procedure is extremely slow, its not practical to overwrite old data every time. It is also decremental to the life span of flash memory.&lt;br /&gt;
	To maximize the potential of flash based memory the file system would have to write new data to empty memory blocks. This method would also call for some sort of garbage collection to erase unused blocks when the system is idle, which does not get implemented in conventional file systems since it is not needed.&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Flash Optimized File Systems==&lt;br /&gt;
&#039;&#039;Flash Optimized files systems do to optimize HDD read/write/etc&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Similarities==&lt;br /&gt;
&#039;&#039;this could probably be titled better&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Differences==&lt;br /&gt;
&#039;&#039;ditto for this one&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
=External links=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Hey guys,&lt;br /&gt;
&lt;br /&gt;
This is what I&#039;ve got so far... mostly based on wikipedia:&lt;br /&gt;
&lt;br /&gt;
Flash memory has two limitations: it can only be erased in blocks and and it wears out after a certain number of erase cycles. Furthermore, a particular kind of Flash memory (NAND) is not able to provide random access.  &lt;br /&gt;
As a result of these Flash based file-systems cannot be handled in the same way as disk-based file systems. Here are a few of the key differences:&lt;br /&gt;
&lt;br /&gt;
-	Because memory must be erased in blocks, its erasure tends to take up time. Consequently, it is necessary to time the erasures in a way so as not to interfere with the efficiency of the system’s other operations. This is is not a real concern with disk-based file-systems. &lt;br /&gt;
-	A disk file-system needs to minimize the seeking time, but Flash file-system does not concern itself with this as it doesn’t have a disk. &lt;br /&gt;
-	A flash system tries to distribute memory in such a way so as not to make a particular block of memory subject to a disproportionally large number of erasures. The purpose of this is to keep the block from wearing out prematurely. The result of it is that memory needs to be distributed differently than in a disk based file-system. &lt;br /&gt;
Log-sturctured file systems are thus best suited to dealing with flash memory (they apparently do all of the above things). &lt;br /&gt;
&lt;br /&gt;
For the essay form, I&#039;m thinking of doing a section about traditional hard-disk systems, another about flash-memory and a third about flash systems. At this point, I am imagining the thesis as something like, &amp;quot;Flash systems require a fundamentally different system architecture than disk-based systems due to their need to adapt to the constraints inherent in flash memory: specifically, due to that memory&#039;s limited life-span and block-based erasures.&amp;quot; The argument would then talk about how these two differences directly lead to a new FS approach. &lt;br /&gt;
&lt;br /&gt;
That&#039;s how I see it at the moment. Honestly, I don&#039;t like doing research about this kind of stuff, so my data isn&#039;t very deep. That said, if you guys could find more info and summarize it, I&#039;m pretty sure that I could synthesize it all into a coherent essay. &lt;br /&gt;
&lt;br /&gt;
Fedor&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_10&amp;diff=3608</id>
		<title>COMP 3000 Essay 1 2010 Question 10</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=COMP_3000_Essay_1_2010_Question_10&amp;diff=3608"/>
		<updated>2010-10-14T04:30:08Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: /* Traditionally Optimized File Systems */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Question=&lt;br /&gt;
&lt;br /&gt;
How do the constraints of flash storage affect the design of flash-optimized file systems? Explain by contrasting with hard disk-based file systems.&lt;br /&gt;
&lt;br /&gt;
=Answer=&lt;br /&gt;
&#039;&#039;an introduction goes here&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Flash Memory==&lt;br /&gt;
&#039;&#039;basic info on flash memory&#039;&#039;&lt;br /&gt;
===Constraints===&lt;br /&gt;
&#039;&#039;--&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Hard Disk drives==&lt;br /&gt;
&#039;&#039;basic info on HDDs&#039;&#039;&lt;br /&gt;
===Constraints===&lt;br /&gt;
&#039;&#039;--&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Traditionally Optimized File Systems==&lt;br /&gt;
&#039;&#039;things traditional files systems do to optimize HDD read/write/etc&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
	Most conventional file systems are designed to me implemented on hard disk drives. This fact does not mean they cannot be implemented on a solid state drive (file storage that uses flash memory instead of magnetic discs). It would however, in many ways, defeat the purpose of using flash memory. The most consuming process for an HDD is seeking data by relocating the read-head and spinning the magnetic disk. A traditional file system optimizes the way it stores data by placing related blocks close-by on the disk to minimize mechanical movement within the HDD. One of the great advantages of flash memory, which accounts for its fast read speed, is that there is no need to seek data physically so there is no need to waste resources laying out the data in close proximity.&lt;br /&gt;
	A traditional HDD file system will also attempt to defragment itself, moving blocks of data around for closer proximity on the magnetic disk. This process, although beneficial for HDD&#039;s, is harmful and inefficient for flash based storage. A flash optimal file system needs to reduce the amount of erase operations, since flash memory only has a limited amount of erase cycles as well as having very slow erase speeds.&lt;br /&gt;
	When an HDD rewrites data to a physical location there is no need for it to erase the previously occupying data first, so a traditional disk based file system doesn&#039;t worry about erasing data from unused memory blocks. In contrast flash memory needs to first erase the data block before it can modify any of it contents. Since the erase procedure is extremely slow, its not practical to overwrite old data every time. It is also decremental to the life span of flash memory.&lt;br /&gt;
	To maximize the potential of flash based memory the file system would have to write new data to empty memory blocks. This method would also call for some sort of garbage collection to erase unused blocks when the system is idle, which does not get implemented in conventional file systems since it is not needed.&lt;br /&gt;
&lt;br /&gt;
==Flash Optimized File Systems==&lt;br /&gt;
&#039;&#039;Flash Optimized files systems do to optimize HDD read/write/etc&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Similarities==&lt;br /&gt;
&#039;&#039;this could probably be titled better&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
==Differences==&lt;br /&gt;
&#039;&#039;ditto for this one&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&lt;br /&gt;
=External links=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Hey guys,&lt;br /&gt;
&lt;br /&gt;
This is what I&#039;ve got so far... mostly based on wikipedia:&lt;br /&gt;
&lt;br /&gt;
Flash memory has two limitations: it can only be erased in blocks and and it wears out after a certain number of erase cycles. Furthermore, a particular kind of Flash memory (NAND) is not able to provide random access.  &lt;br /&gt;
As a result of these Flash based file-systems cannot be handled in the same way as disk-based file systems. Here are a few of the key differences:&lt;br /&gt;
&lt;br /&gt;
-	Because memory must be erased in blocks, its erasure tends to take up time. Consequently, it is necessary to time the erasures in a way so as not to interfere with the efficiency of the system’s other operations. This is is not a real concern with disk-based file-systems. &lt;br /&gt;
-	A disk file-system needs to minimize the seeking time, but Flash file-system does not concern itself with this as it doesn’t have a disk. &lt;br /&gt;
-	A flash system tries to distribute memory in such a way so as not to make a particular block of memory subject to a disproportionally large number of erasures. The purpose of this is to keep the block from wearing out prematurely. The result of it is that memory needs to be distributed differently than in a disk based file-system. &lt;br /&gt;
Log-sturctured file systems are thus best suited to dealing with flash memory (they apparently do all of the above things). &lt;br /&gt;
&lt;br /&gt;
For the essay form, I&#039;m thinking of doing a section about traditional hard-disk systems, another about flash-memory and a third about flash systems. At this point, I am imagining the thesis as something like, &amp;quot;Flash systems require a fundamentally different system architecture than disk-based systems due to their need to adapt to the constraints inherent in flash memory: specifically, due to that memory&#039;s limited life-span and block-based erasures.&amp;quot; The argument would then talk about how these two differences directly lead to a new FS approach. &lt;br /&gt;
&lt;br /&gt;
That&#039;s how I see it at the moment. Honestly, I don&#039;t like doing research about this kind of stuff, so my data isn&#039;t very deep. That said, if you guys could find more info and summarize it, I&#039;m pretty sure that I could synthesize it all into a coherent essay. &lt;br /&gt;
&lt;br /&gt;
Fedor&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_10&amp;diff=3606</id>
		<title>Talk:COMP 3000 Essay 1 2010 Question 10</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_10&amp;diff=3606"/>
		<updated>2010-10-14T04:28:25Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Hey all,&lt;br /&gt;
&lt;br /&gt;
I think we should write down our emails here so we can further discuss stuff without having to login here.&lt;br /&gt;
(&#039;&#039;&#039;***Note that discussions over email can&#039;t be counted towards your participation grade!***&#039;&#039;&#039;--[[User:Soma|Anil]])&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Geoff Smith (gsmith0413@gmail.com) - gsmith6&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Andrew Bujáki (abujaki [at] Connect or Live.ca)&lt;br /&gt;
***I&#039;m usually on MSN(Live) for collaboration at nights, Just make sure to put in a little message about who you are when you&#039;re adding me. :)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
I used Google Scholar and came to this page http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=812717&amp;amp;tag=1#&lt;br /&gt;
Which briefly touches on the issues of Flash memory. Specifically, inability to update in place, and limited write/erase cycles.&lt;br /&gt;
&lt;br /&gt;
Inability to update in place could refer to the way the flash disk is programmed, instead of bit-by-bit, it is programmed block-by-block. A block would have to be erased and completely reprogrammed in order to flip one bit after it&#039;s been set.&lt;br /&gt;
http://en.wikipedia.org/wiki/Flash_memory#Block_erasure&lt;br /&gt;
&lt;br /&gt;
Limited write/erase: Flash memory typically has a short lifespan if it&#039;s being used a lot. Writing and erasing the memory (Changing, updating, etc) Will wear it out. Flash memory has a finite amount of writes, (varying on manufacturer, models, etc), and once they&#039;ve been used up, you&#039;ll get bad sectors, corrupt data, and generally be SOL.&lt;br /&gt;
http://en.wikipedia.org/wiki/Flash_memory#Memory_wear&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Filesystems would have to be changed to play nicely with these constraints, where it must use blocks efficiently and nicely, and minimize writing/erasing as much as possible.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
I found a paper that talks about the performance, capabilities and limitations of NAND flash storage. &lt;br /&gt;
&lt;br /&gt;
Abstract: &amp;quot;This presentation provides an in-depth examination of the&lt;br /&gt;
fundamental theoretical performance, capabilities, and&lt;br /&gt;
limitations of NAND Flash-based Solid State Storage (SSS). The&lt;br /&gt;
tutorial will explore the raw performance capabilities of NAND&lt;br /&gt;
Flash, and limitations to performance imposed by mitigation of&lt;br /&gt;
reliability issues, interfaces, protocols, and technology types.&lt;br /&gt;
Best practices for system integration of SSS will be discussed.&lt;br /&gt;
Performance achievements will be reviewed for various&lt;br /&gt;
products and applications. &amp;quot;&lt;br /&gt;
&lt;br /&gt;
Link: http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2009/20090812_T1B_Smith.pdf&lt;br /&gt;
&lt;br /&gt;
There&#039;s no Starting place like Wikipedia, even if you shouldn&#039;t source it. &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Flash_Memory &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/LogFS &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Hard_disk &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Wear_leveling &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Hot_spot_%28computer_science%29&lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Solid-state_drive&lt;br /&gt;
&lt;br /&gt;
Hey Guys,&lt;br /&gt;
&lt;br /&gt;
We really don&#039;t have much time to get this done. Lets meet tomorrow after class and get our bearings to do this properly.&lt;br /&gt;
&lt;br /&gt;
Fedor&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A few of us have Networking immediately after class. I know personally I won&#039;t be able to make anything set on Tuesday.&lt;br /&gt;
Additionally, he spoke briefly about hotspots on the disk for our question last week, where places on the disk would be written to far more often than others. &lt;br /&gt;
As well, for bibliographical citing, http://bibme.org is a wonderful resource for the popular formats (I.e. MLA). If it should come down to that.&lt;br /&gt;
~Andrew&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===links===&lt;br /&gt;
&lt;br /&gt;
Start Posting some stuff to source from:&lt;br /&gt;
&lt;br /&gt;
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1199079&amp;amp;tag=1&lt;br /&gt;
--&amp;quot;Introduction to flash memory&amp;quot;&lt;br /&gt;
&lt;br /&gt;
http://portal.acm.org/citation.cfm?id=1244248&lt;br /&gt;
--&amp;quot;Wear Leveling&amp;quot; (it&#039;s about a proposed way of doing it, but explains a whole bunch of other things to do that)&lt;br /&gt;
&lt;br /&gt;
http://portal.acm.org/citation.cfm?id=1731355&lt;br /&gt;
--&amp;quot;Online maintenance of very large random samples on flash storage&amp;quot; (ie dealing with the constraints of Flash Storage in a system that might actually be written to 100000 times)&lt;br /&gt;
&lt;br /&gt;
http://vlsi.kaist.ac.kr/paper_list/2006_TC_CFFS.pdf&lt;br /&gt;
--&amp;quot;An Efficient NAND Flash File System for Flash Memory Storage&amp;quot; discuses shortcomings of using hard disk based file systems and current flash based file systems&lt;br /&gt;
&lt;br /&gt;
http://maltiel-consulting.com/NAND_vs_NOR_Flash_Memory_Technology_Overview_Read_Write_Erase_speed_for_SLC_MLC_semiconductor_consulting_expert.pdf&lt;br /&gt;
--&amp;quot;NAND vs NOR Flash Memory&amp;quot; (note: i didn&#039;t get this off of Google scholar but it seems to be written by someone from Toshiba. is that ok?)&lt;br /&gt;
&lt;br /&gt;
Hi everybody,&lt;br /&gt;
&lt;br /&gt;
So here are the latest news. Geoff, Andrew and myself had a meeting after class today and came up with a plan for writing this thing. &lt;br /&gt;
&lt;br /&gt;
We decided to have 3 parts:&lt;br /&gt;
&lt;br /&gt;
1. What flash storage is, why its good but also why it must have the problems that it does (the assumption is that it must have them, why would it otherwise?)&lt;br /&gt;
[don&#039;t know much about this just now... basics include that there is NOR (reads slightly faster)and NAND (holds more, writes faster, erases much faster, lasts about ten times longer) flash with NAND being especially popular for storage (what&#039;s NOR good for?). Here, we&#039;d ideally want to talk about why flash was invented (supposed as an alternative to slow ROM), why it was suitable for that, and how it works on a technical level. Then, we&#039;d want to mention why this technical functionality was innovative and useful but also why it came with two serious set-backs: having a limited-number of re-write cycles and needing to erase a block at a time.]&lt;br /&gt;
&lt;br /&gt;
Either way, Flash storage affords far faster fetch times than the traditional platter-based HDD, and stability of information in a sense. Where the data is not actually stored, but reprogrammed, in a sense, the data is more secure and is less likely to be erased easily. On that note, in order to flip a single bit, that entire block will need to be erased, then reprogrammed. In an &#039;old&#039; HDD, let&#039;s say, When the HDD fails at the end of its life cycle, your data is gone. (unless you&#039;re willing to shell out $200/hr to have it recovered, yes I&#039;ve seen companies in Ottawa that do this.) In a flash HDD, when it reaches the end of its life, it merely becomes read-only. Bugger for Databases, but useful for technical notes and archives, let&#039;s say.&lt;br /&gt;
With today&#039;s modern gaming computers, Flash memory can be good on quick load times, however with limited read-writes, it could afford better use to things that are not updated as frequently. I.e... Well I don&#039;t have a better example than a webserver hosting a company&#039;s CSS and scripts.&lt;br /&gt;
~Source: Years in the &#039;biz &lt;br /&gt;
&lt;br /&gt;
Flash memory started out as a replacement for EPROMs.  At the time EPROMs needed a UV photoemission to be erased while flash memory could be erased electronically. The first flash memory product came out in 1988 but it did not take off until the late 1990’s because it could not be reliable produced. NOR and NAND memory is named after the arrangement of the cells in the memory array. NOR based flash memory benefits from having very fast burst read times but slower write times. Due to the structure of NOR memory programs stored in NOR based memory can be executed without being loaded into RAM first. NAND flash memory has a very large storage capacity and can read and write large files relatively fast. NAND is more suited for storage while NOR memory is better suited for direct program execution such as in CMOS chips.&lt;br /&gt;
source: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1199079&amp;amp;tag=1 , http://maltiel-consulting.com/NAND_vs_NOR_Flash_Memory_Technology_Overview_Read_Write_Erase_speed_for_SLC_MLC_semiconductor_consulting_expert.pdf&lt;br /&gt;
&lt;br /&gt;
2. How a traditional disk-based file-system works and why the limitations of flash storage make the two a poor match&lt;br /&gt;
[the obvious answer seems to be that traditional file-systems could just write to whatever memory was available but if they did this with a flash file-systems, certain chunks of memory would become unusable before others and the memory would be more difficult to work with. Also, disk-based file systems need to deal with seeking times which means that they want to organize their data in such a way as to reduce those (by putting related things together?) - with Flash, this isn&#039;t really a problem and thus one constraint the less to be concerned with.]&lt;br /&gt;
&lt;br /&gt;
3. How a log based file-system works and why this method of operation is so well suited to working with flash memory especially in light of the latter&#039;s inherent limitations&lt;br /&gt;
[...]&lt;br /&gt;
&lt;br /&gt;
At this time, the plan is that Geoff will work on #3 today, Andrew will work on #1 tomorrow and I will work on #2 tomorrow. The three of us will make an effort to consult some somewhat more painfully technical literature in order to gain insight into our respective queries. Whatever insight we find will be posted here. &lt;br /&gt;
&lt;br /&gt;
Then, we will meet again on Thursday after class to decide how to actually write the essay.&lt;br /&gt;
&lt;br /&gt;
PS, if there is anybody in the group besides the three of us - let us know so you can find a way to contribute to this... as at least two of us are competent essayists, painfully technical research would on one or more of the above topics would be a great way to contribute... especially if you could post it here prior to one of us going over the same thing. &lt;br /&gt;
&lt;br /&gt;
Fedor&lt;br /&gt;
&lt;br /&gt;
-- I&#039;m not that great (but absolutely horrid) at essays and I&#039;m alright at research, but if nothing else I have Thursday off and nothing (else) that needs doing by Friday so I can probably spend a bunch of time working on it just before it&#039;s due. -- &#039;&#039;Nick L&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
-- Hay sorry I was unable to attend the meeting after class today. I am not too good at writing essays as well but I am pretty good at summarizing and researching. I am not too sure at what you would like me to do. Right now I&#039;ll assume you need me to research/summarizing articles for the 3 topics above. If you need me to do anything else post it here. I&#039;ll be checking the discussion regularly until this due. once again sorry for missing the meeting-- Paul Cox.&lt;br /&gt;
&lt;br /&gt;
-- Hey i&#039;m also supposed to be in on this. Sorry i couldn&#039;t contribute sooner because i was playing catchup in my other classes. Let me know what i can do and i&#039;ll be on it asap. - kirill (k.kashigin@gmail.com)&lt;br /&gt;
update: i&#039;m gonna be helping Fedor with #2&lt;br /&gt;
&lt;br /&gt;
PS, this article http://docs.google.com/viewer?a=v&amp;amp;q=cache:E7-H_pv_18wJ:citeseerx.ist.psu.edu/viewdoc/download%3Fdoi%3D10.1.1.92.2279%26rep%3Drep1%26type%3Dpdf+flash+memory+and+disk-based+file+systems&amp;amp;hl=en&amp;amp;gl=ca&amp;amp;pid=bl&amp;amp;srcid=ADGEESgspy-jqIdLOpaLYlPPoM56kjLPwXcL3_eMbTTBRkI7PG0jQKl9vIieTAYHubPu0EdQ0V4ccaf_p0S_SnqKMirSIM0Qoq5E0NpLd0M7LAGaE51wkD0F55cRSkX8dnTqx_9Yx2E7&amp;amp;sig=AHIEtbS-yfGI9Y48DJ0WyEEhmsXInelRGw looks really useful for part 3.&lt;br /&gt;
&lt;br /&gt;
PPS, and this article looks really great for understanding how log based file systems work: http://delivery.acm.org/10.1145/150000/146943/p26-rosenblum.pdf?key1=146943&amp;amp;key2=3656986821&amp;amp;coll=GUIDE&amp;amp;dl=GUIDE&amp;amp;CFID=108397378&amp;amp;CFTOKEN=72657973&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Hey Luc (TA) here, Anandtech ran a series of articles on solid state drives that you guys might find useful.  It mostly looked at hardware aspects but it gives some interesting insights on how to modify file systems to better support flash memory.&lt;br /&gt;
&lt;br /&gt;
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&lt;br /&gt;
&lt;br /&gt;
http://www.anandtech.com/storage/showdoc.aspx?i=3531&amp;amp;p=1&lt;br /&gt;
&lt;br /&gt;
http://anandtech.com/storage/showdoc.aspx?i=3631&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
--[[User:3maisons|3maisons]] 19:44, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Hey Paul&amp;amp;Kirill,&lt;br /&gt;
&lt;br /&gt;
If one of you guys could help me out with #2, that would be really great. I was going to work on that tomorrow, but I also have another large assignment to deal with and not having to do this research would greatly ease my life. Moreover, I do intend to work on writing&amp;amp;polishing the essay on Thursday as I have a lot of experience with that and it far more than research. Let me know if either one of you can help me with this. &lt;br /&gt;
&lt;br /&gt;
The other person could probably read over what Luc posted for us and see if it fits into our framework. Just be sure to state who is going to do what. &lt;br /&gt;
&lt;br /&gt;
Nick, &lt;br /&gt;
&lt;br /&gt;
Honestly, we really hope to have the research done by Thursday. If that is the only day that you are free and you&#039;re not a writer, I&#039;m honestly not sure what you could do. Perhaps someone else can think of something.&lt;br /&gt;
&lt;br /&gt;
- Fedor&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
I&#039;m gonna have something for #2 up tonight. -kirill&lt;br /&gt;
&lt;br /&gt;
So I found this article on Reddit, posted from Linux Weekly News on pretty much exactly what we are looking at. It&#039;s entitled &amp;quot;Solid-state storage devices and the block layer&amp;quot;&lt;br /&gt;
&lt;br /&gt;
http://lwn.net/SubscriberLink/408428/68fa8465da45967a/    --[[User:Gsmith6|Gsmith6]] 20:36, 13 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
I wasn&#039;t exactly sure how much information i was supposed to present but here&#039;s what i got for #2:&lt;br /&gt;
&lt;br /&gt;
	Most conventional file systems are designed to me implemented on hard disk drives. This fact does not mean they cannot be implemented on a solid state drive (file storage that uses flash memory instead of magnetic discs). It would however, in many ways, defeat the purpose of using flash memory. The most consuming process for an HDD is seeking data by relocating the read-head and spinning the magnetic disk. A traditional file system optimizes the way it stores data by placing related blocks close-by on the disk to minimize mechanical movement within the HDD. One of the great advantages of flash memory, which accounts for its fast read speed, is that there is no need to seek data physically so there is no need to waste resources laying out the data in close proximity.&lt;br /&gt;
	A traditional HDD file system will also attempt to defragment itself, moving blocks of data around for closer proximity on the magnetic disk. This process, although beneficial for HDD&#039;s, is harmful and inefficient for flash based storage. A flash optimal file system needs to reduce the amount of erase operations, since flash memory only has a limited amount of erase cycles as well as having very slow erase speeds.&lt;br /&gt;
	When an HDD rewrites data to a physical location there is no need for it to erase the previously occupying data first, so a traditional disk based file system doesn&#039;t worry about erasing data from unused memory blocks. In contrast flash memory needs to first erase the data block before it can modify any of it contents. Since the erase procedure is extremely slow, its not practical to overwrite old data every time. It is also decremental to the life span of flash memory.&lt;br /&gt;
	To maximize the potential of flash based memory the file system would have to write new data to empty memory blocks. This method would also call for some sort of garbage collection to erase unused blocks when the system is idle, which does not get implemented in conventional file systems since it is not needed.&lt;br /&gt;
&lt;br /&gt;
--kirill&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_10&amp;diff=3262</id>
		<title>Talk:COMP 3000 Essay 1 2010 Question 10</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_10&amp;diff=3262"/>
		<updated>2010-10-13T15:25:06Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Hey all,&lt;br /&gt;
&lt;br /&gt;
I think we should write down our emails here so we can further discuss stuff without having to login here.&lt;br /&gt;
(&#039;&#039;&#039;***Note that discussions over email can&#039;t be counted towards your participation grade!***&#039;&#039;&#039;--[[User:Soma|Anil]])&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Geoff Smith (gsmith0413@gmail.com) - gsmith6&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Andrew Bujáki (abujaki [at] Connect or Live.ca)&lt;br /&gt;
***I&#039;m usually on MSN(Live) for collaboration at nights, Just make sure to put in a little message about who you are when you&#039;re adding me. :)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
I used Google Scholar and came to this page http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=812717&amp;amp;tag=1#&lt;br /&gt;
Which briefly touches on the issues of Flash memory. Specifically, inability to update in place, and limited write/erase cycles.&lt;br /&gt;
&lt;br /&gt;
Inability to update in place could refer to the way the flash disk is programmed, instead of bit-by-bit, it is programmed block-by-block. A block would have to be erased and completely reprogrammed in order to flip one bit after it&#039;s been set.&lt;br /&gt;
http://en.wikipedia.org/wiki/Flash_memory#Block_erasure&lt;br /&gt;
&lt;br /&gt;
Limited write/erase: Flash memory typically has a short lifespan if it&#039;s being used a lot. Writing and erasing the memory (Changing, updating, etc) Will wear it out. Flash memory has a finite amount of writes, (varying on manufacturer, models, etc), and once they&#039;ve been used up, you&#039;ll get bad sectors, corrupt data, and generally be SOL.&lt;br /&gt;
http://en.wikipedia.org/wiki/Flash_memory#Memory_wear&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Filesystems would have to be changed to play nicely with these constraints, where it must use blocks efficiently and nicely, and minimize writing/erasing as much as possible.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
I found a paper that talks about the performance, capabilities and limitations of NAND flash storage. &lt;br /&gt;
&lt;br /&gt;
Abstract: &amp;quot;This presentation provides an in-depth examination of the&lt;br /&gt;
fundamental theoretical performance, capabilities, and&lt;br /&gt;
limitations of NAND Flash-based Solid State Storage (SSS). The&lt;br /&gt;
tutorial will explore the raw performance capabilities of NAND&lt;br /&gt;
Flash, and limitations to performance imposed by mitigation of&lt;br /&gt;
reliability issues, interfaces, protocols, and technology types.&lt;br /&gt;
Best practices for system integration of SSS will be discussed.&lt;br /&gt;
Performance achievements will be reviewed for various&lt;br /&gt;
products and applications. &amp;quot;&lt;br /&gt;
&lt;br /&gt;
Link: http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2009/20090812_T1B_Smith.pdf&lt;br /&gt;
&lt;br /&gt;
There&#039;s no Starting place like Wikipedia, even if you shouldn&#039;t source it. &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Flash_Memory &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/LogFS &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Hard_disk &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Wear_leveling &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Hot_spot_%28computer_science%29&lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Solid-state_drive&lt;br /&gt;
&lt;br /&gt;
Hey Guys,&lt;br /&gt;
&lt;br /&gt;
We really don&#039;t have much time to get this done. Lets meet tomorrow after class and get our bearings to do this properly.&lt;br /&gt;
&lt;br /&gt;
Fedor&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A few of us have Networking immediately after class. I know personally I won&#039;t be able to make anything set on Tuesday.&lt;br /&gt;
Additionally, he spoke briefly about hotspots on the disk for our question last week, where places on the disk would be written to far more often than others. &lt;br /&gt;
As well, for bibliographical citing, http://bibme.org is a wonderful resource for the popular formats (I.e. MLA). If it should come down to that.&lt;br /&gt;
~Andrew&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===links===&lt;br /&gt;
&lt;br /&gt;
Start Posting some stuff to source from:&lt;br /&gt;
&lt;br /&gt;
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1199079&amp;amp;tag=1&lt;br /&gt;
--&amp;quot;Introduction to flash memory&amp;quot;&lt;br /&gt;
&lt;br /&gt;
http://portal.acm.org/citation.cfm?id=1244248&lt;br /&gt;
--&amp;quot;Wear Leveling&amp;quot; (it&#039;s about a proposed way of doing it, but explains a whole bunch of other things to do that)&lt;br /&gt;
&lt;br /&gt;
http://portal.acm.org/citation.cfm?id=1731355&lt;br /&gt;
--&amp;quot;Online maintenance of very large random samples on flash storage&amp;quot; (ie dealing with the constraints of Flash Storage in a system that might actually be written to 100000 times)&lt;br /&gt;
&lt;br /&gt;
http://vlsi.kaist.ac.kr/paper_list/2006_TC_CFFS.pdf&lt;br /&gt;
--&amp;quot;An Efficient NAND Flash File System for Flash Memory Storage&amp;quot; discuses shortcomings of using hard disk based file systems and current flash based file systems&lt;br /&gt;
&lt;br /&gt;
http://maltiel-consulting.com/NAND_vs_NOR_Flash_Memory_Technology_Overview_Read_Write_Erase_speed_for_SLC_MLC_semiconductor_consulting_expert.pdf&lt;br /&gt;
--&amp;quot;NAND vs NOR Flash Memory&amp;quot; (note: i didn&#039;t get this off of Google scholar but it seems to be written by someone from Toshiba. is that ok?)&lt;br /&gt;
&lt;br /&gt;
Hi everybody,&lt;br /&gt;
&lt;br /&gt;
So here are the latest news. Geoff, Andrew and myself had a meeting after class today and came up with a plan for writing this thing. &lt;br /&gt;
&lt;br /&gt;
We decided to have 3 parts:&lt;br /&gt;
&lt;br /&gt;
1. What flash storage is, why its good but also why it must have the problems that it does (the assumption is that it must have them, why would it otherwise?)&lt;br /&gt;
[don&#039;t know much about this just now... basics include that there is NOR (reads slightly faster)and NAND (holds more, writes faster, erases much faster, lasts about ten times longer) flash with NAND being especially popular for storage (what&#039;s NOR good for?). Here, we&#039;d ideally want to talk about why flash was invented (supposed as an alternative to slow ROM), why it was suitable for that, and how it works on a technical level. Then, we&#039;d want to mention why this technical functionality was innovative and useful but also why it came with two serious set-backs: having a limited-number of re-write cycles and needing to erase a block at a time.]&lt;br /&gt;
&lt;br /&gt;
Either way, Flash storage affords far faster fetch times than the traditional platter-based HDD, and stability of information in a sense. Where the data is not actually stored, but reprogrammed, in a sense, the data is more secure and is less likely to be erased easily. On that note, in order to flip a single bit, that entire block will need to be erased, then reprogrammed. In an &#039;old&#039; HDD, let&#039;s say, When the HDD fails at the end of its life cycle, your data is gone. (unless you&#039;re willing to shell out $200/hr to have it recovered, yes I&#039;ve seen companies in Ottawa that do this.) In a flash HDD, when it reaches the end of its life, it merely becomes read-only. Bugger for Databases, but useful for technical notes and archives, let&#039;s say.&lt;br /&gt;
With today&#039;s modern gaming computers, Flash memory can be good on quick load times, however with limited read-writes, it could afford better use to things that are not updated as frequently. I.e... Well I don&#039;t have a better example than a webserver hosting a company&#039;s CSS and scripts.&lt;br /&gt;
~Source: Years in the &#039;biz &lt;br /&gt;
&lt;br /&gt;
Flash memory started out as a replacement for EPROMs.  At the time EPROMs needed a UV photoemission to be erased while flash memory could be erased electronically. The first flash memory product came out in 1988 but it did not take off until the late 1990’s because it could not be reliable produced. NOR and NAND memory is named after the arrangement of the cells in the memory array. NOR based flash memory benefits from having very fast burst read times but slower write times. Due to the structure of NOR memory programs stored in NOR based memory can be executed without being loaded into RAM first. NAND flash memory has a very large storage capacity and can read and write large files relatively fast. NAND is more suited for storage while NOR memory is better suited for direct program execution such as in CMOS chips.&lt;br /&gt;
source: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1199079&amp;amp;tag=1 , http://maltiel-consulting.com/NAND_vs_NOR_Flash_Memory_Technology_Overview_Read_Write_Erase_speed_for_SLC_MLC_semiconductor_consulting_expert.pdf&lt;br /&gt;
&lt;br /&gt;
2. How a traditional disk-based file-system works and why the limitations of flash storage make the two a poor match&lt;br /&gt;
[the obvious answer seems to be that traditional file-systems could just write to whatever memory was available but if they did this with a flash file-systems, certain chunks of memory would become unusable before others and the memory would be more difficult to work with. Also, disk-based file systems need to deal with seeking times which means that they want to organize their data in such a way as to reduce those (by putting related things together?) - with Flash, this isn&#039;t really a problem and thus one constraint the less to be concerned with.]&lt;br /&gt;
&lt;br /&gt;
3. How a log based file-system works and why this method of operation is so well suited to working with flash memory especially in light of the latter&#039;s inherent limitations&lt;br /&gt;
[...]&lt;br /&gt;
&lt;br /&gt;
At this time, the plan is that Geoff will work on #3 today, Andrew will work on #1 tomorrow and I will work on #2 tomorrow. The three of us will make an effort to consult some somewhat more painfully technical literature in order to gain insight into our respective queries. Whatever insight we find will be posted here. &lt;br /&gt;
&lt;br /&gt;
Then, we will meet again on Thursday after class to decide how to actually write the essay.&lt;br /&gt;
&lt;br /&gt;
PS, if there is anybody in the group besides the three of us - let us know so you can find a way to contribute to this... as at least two of us are competent essayists, painfully technical research would on one or more of the above topics would be a great way to contribute... especially if you could post it here prior to one of us going over the same thing. &lt;br /&gt;
&lt;br /&gt;
Fedor&lt;br /&gt;
&lt;br /&gt;
-- I&#039;m not that great (but absolutely horrid) at essays and I&#039;m alright at research, but if nothing else I have Thursday off and nothing (else) that needs doing by Friday so I can probably spend a bunch of time working on it just before it&#039;s due. -- &#039;&#039;Nick L&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
-- Hay sorry I was unable to attend the meeting after class today. I am not too good at writing essays as well but I am pretty good at summarizing and researching. I am not too sure at what you would like me to do. Right now I&#039;ll assume you need me to research/summarizing articles for the 3 topics above. If you need me to do anything else post it here. I&#039;ll be checking the discussion regularly until this due. once again sorry for missing the meeting-- Paul Cox.&lt;br /&gt;
&lt;br /&gt;
-- Hey i&#039;m also supposed to be in on this. Sorry i couldn&#039;t contribute sooner because i was playing catchup in my other classes. Let me know what i can do and i&#039;ll be on it asap. - kirill (k.kashigin@gmail.com)&lt;br /&gt;
update: i&#039;m gonna be helping Fedor with #2&lt;br /&gt;
&lt;br /&gt;
PS, this article http://docs.google.com/viewer?a=v&amp;amp;q=cache:E7-H_pv_18wJ:citeseerx.ist.psu.edu/viewdoc/download%3Fdoi%3D10.1.1.92.2279%26rep%3Drep1%26type%3Dpdf+flash+memory+and+disk-based+file+systems&amp;amp;hl=en&amp;amp;gl=ca&amp;amp;pid=bl&amp;amp;srcid=ADGEESgspy-jqIdLOpaLYlPPoM56kjLPwXcL3_eMbTTBRkI7PG0jQKl9vIieTAYHubPu0EdQ0V4ccaf_p0S_SnqKMirSIM0Qoq5E0NpLd0M7LAGaE51wkD0F55cRSkX8dnTqx_9Yx2E7&amp;amp;sig=AHIEtbS-yfGI9Y48DJ0WyEEhmsXInelRGw looks really useful for part 3.&lt;br /&gt;
&lt;br /&gt;
PPS, and this article looks really great for understanding how log based file systems work: http://delivery.acm.org/10.1145/150000/146943/p26-rosenblum.pdf?key1=146943&amp;amp;key2=3656986821&amp;amp;coll=GUIDE&amp;amp;dl=GUIDE&amp;amp;CFID=108397378&amp;amp;CFTOKEN=72657973&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Hey Luc (TA) here, Anandtech ran a series of articles on solid state drives that you guys might find useful.  It mostly looked at hardware aspects but it gives some interesting insights on how to modify file systems to better support flash memory.&lt;br /&gt;
&lt;br /&gt;
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&lt;br /&gt;
&lt;br /&gt;
http://www.anandtech.com/storage/showdoc.aspx?i=3531&amp;amp;p=1&lt;br /&gt;
&lt;br /&gt;
http://anandtech.com/storage/showdoc.aspx?i=3631&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
--[[User:3maisons|3maisons]] 19:44, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Hey Paul&amp;amp;Kirill,&lt;br /&gt;
&lt;br /&gt;
If one of you guys could help me out with #2, that would be really great. I was going to work on that tomorrow, but I also have another large assignment to deal with and not having to do this research would greatly ease my life. Moreover, I do intend to work on writing&amp;amp;polishing the essay on Thursday as I have a lot of experience with that and it far more than research. Let me know if either one of you can help me with this. &lt;br /&gt;
&lt;br /&gt;
The other person could probably read over what Luc posted for us and see if it fits into our framework. Just be sure to state who is going to do what. &lt;br /&gt;
&lt;br /&gt;
Nick, &lt;br /&gt;
&lt;br /&gt;
Honestly, we really hope to have the research done by Thursday. If that is the only day that you are free and you&#039;re not a writer, I&#039;m honestly not sure what you could do. Perhaps someone else can think of something.&lt;br /&gt;
&lt;br /&gt;
- Fedor&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
I&#039;m gonna have something for #2 up tonight. -kirill&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_10&amp;diff=3157</id>
		<title>Talk:COMP 3000 Essay 1 2010 Question 10</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_10&amp;diff=3157"/>
		<updated>2010-10-13T02:33:05Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Hey all,&lt;br /&gt;
&lt;br /&gt;
I think we should write down our emails here so we can further discuss stuff without having to login here.&lt;br /&gt;
(&#039;&#039;&#039;***Note that discussions over email can&#039;t be counted towards your participation grade!***&#039;&#039;&#039;--[[User:Soma|Anil]])&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Geoff Smith (gsmith0413@gmail.com) - gsmith6&lt;br /&gt;
Andrew Bujáki (abujaki [at] Connect or Live.ca)&lt;br /&gt;
***I&#039;m usually on MSN(Live) for collaboration at nights, Just make sure to put in a little message about who you are when you&#039;re adding me. :)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
I used Google Scholar and came to this page http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=812717&amp;amp;tag=1#&lt;br /&gt;
Which briefly touches on the issues of Flash memory. Specifically, inability to update in place, and limited write/erase cycles.&lt;br /&gt;
&lt;br /&gt;
Inability to update in place could refer to the way the flash disk is programmed, instead of bit-by-bit, it is programmed block-by-block. A block would have to be erased and completely reprogrammed in order to flip one bit after it&#039;s been set.&lt;br /&gt;
http://en.wikipedia.org/wiki/Flash_memory#Block_erasure&lt;br /&gt;
&lt;br /&gt;
Limited write/erase: Flash memory typically has a short lifespan if it&#039;s being used a lot. Writing and erasing the memory (Changing, updating, etc) Will wear it out. Flash memory has a finite amount of writes, (varying on manufacturer, models, etc), and once they&#039;ve been used up, you&#039;ll get bad sectors, corrupt data, and generally be SOL.&lt;br /&gt;
http://en.wikipedia.org/wiki/Flash_memory#Memory_wear&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Filesystems would have to be changed to play nicely with these constraints, where it must use blocks efficiently and nicely, and minimize writing/erasing as much as possible.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
I found a paper that talks about the performance, capabilities and limitations of NAND flash storage. &lt;br /&gt;
&lt;br /&gt;
Abstract: &amp;quot;This presentation provides an in-depth examination of the&lt;br /&gt;
fundamental theoretical performance, capabilities, and&lt;br /&gt;
limitations of NAND Flash-based Solid State Storage (SSS). The&lt;br /&gt;
tutorial will explore the raw performance capabilities of NAND&lt;br /&gt;
Flash, and limitations to performance imposed by mitigation of&lt;br /&gt;
reliability issues, interfaces, protocols, and technology types.&lt;br /&gt;
Best practices for system integration of SSS will be discussed.&lt;br /&gt;
Performance achievements will be reviewed for various&lt;br /&gt;
products and applications. &amp;quot;&lt;br /&gt;
&lt;br /&gt;
Link: http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2009/20090812_T1B_Smith.pdf&lt;br /&gt;
&lt;br /&gt;
There&#039;s no Starting place like Wikipedia, even if you shouldn&#039;t source it. &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Flash_Memory &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/LogFS &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Hard_disk &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Wear_leveling &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Hot_spot_%28computer_science%29&lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Solid-state_drive&lt;br /&gt;
&lt;br /&gt;
Hey Guys,&lt;br /&gt;
&lt;br /&gt;
We really don&#039;t have much time to get this done. Lets meet tomorrow after class and get our bearings to do this properly.&lt;br /&gt;
&lt;br /&gt;
Fedor&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A few of us have Networking immediately after class. I know personally I won&#039;t be able to make anything set on Tuesday.&lt;br /&gt;
Additionally, he spoke briefly about hotspots on the disk for our question last week, where places on the disk would be written to far more often than others. &lt;br /&gt;
As well, for bibliographical citing, http://bibme.org is a wonderful resource for the popular formats (I.e. MLA). If it should come down to that.&lt;br /&gt;
~Andrew&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===links===&lt;br /&gt;
&lt;br /&gt;
Start Posting some stuff to source from:&lt;br /&gt;
&lt;br /&gt;
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1199079&amp;amp;tag=1&lt;br /&gt;
--&amp;quot;Introduction to flash memory&amp;quot;&lt;br /&gt;
&lt;br /&gt;
http://portal.acm.org/citation.cfm?id=1244248&lt;br /&gt;
--&amp;quot;Wear Leveling&amp;quot; (it&#039;s about a proposed way of doing it, but explains a whole bunch of other things to do that)&lt;br /&gt;
&lt;br /&gt;
http://portal.acm.org/citation.cfm?id=1731355&lt;br /&gt;
--&amp;quot;Online maintenance of very large random samples on flash storage&amp;quot; (ie dealing with the constraints of Flash Storage in a system that might actually be written to 100000 times)&lt;br /&gt;
&lt;br /&gt;
Hi everybody,&lt;br /&gt;
&lt;br /&gt;
So here are the latest news. Geoff, Andrew and myself had a meeting after class today and came up with a plan for writing this thing. &lt;br /&gt;
&lt;br /&gt;
We decided to have 3 parts:&lt;br /&gt;
&lt;br /&gt;
1. What flash storage is, why its good but also why it must have the problems that it does (the assumption is that it must have them, why would it otherwise?)&lt;br /&gt;
[don&#039;t know much about this just now... basics include that there is NOR (reads slightly faster)and NAND (holds more, writes faster, erases much faster, lasts about ten times longer) flash with NAND being especially popular for storage (what&#039;s NOR good for?). Here, we&#039;d ideally want to talk about why flash was invented (supposed as an alternative to slow ROM), why it was suitable for that, and how it works on a technical level. Then, we&#039;d want to mention why this technical functionality was innovative and useful but also why it came with two serious set-backs: having a limited-number of re-write cycles and needing to erase a block at a time.]&lt;br /&gt;
&lt;br /&gt;
2. How a traditional disk-based file-system works and why the limitations of flash storage make the two a poor match&lt;br /&gt;
[the obvious answer seems to be that traditional file-systems could just write to whatever memory was available but if they did this with a flash file-systems, certain chunks of memory would become unusable before others and the memory would be more difficult to work with. Also, disk-based file systems need to deal with seeking times which means that they want to organize their data in such a way as to reduce those (by putting related things together?) - with Flash, this isn&#039;t really a problem and thus one constraint the less to be concerned with.]&lt;br /&gt;
&lt;br /&gt;
3. How a log based file-system works and why this method of operation is so well suited to working with flash memory especially in light of the latter&#039;s inherent limitations&lt;br /&gt;
[...]&lt;br /&gt;
&lt;br /&gt;
At this time, the plan is that Geoff will work on #3 today, Andrew will work on #1 tomorrow and I will work on #2 tomorrow. The three of us will make an effort to consult some somewhat more painfully technical literature in order to gain insight into our respective queries. Whatever insight we find will be posted here. &lt;br /&gt;
&lt;br /&gt;
Then, we will meet again on Thursday after class to decide how to actually write the essay.&lt;br /&gt;
&lt;br /&gt;
PS, if there is anybody in the group besides the three of us - let us know so you can find a way to contribute to this... as at least two of us are competent essayists, painfully technical research would on one or more of the above topics would be a great way to contribute... especially if you could post it here prior to one of us going over the same thing. &lt;br /&gt;
&lt;br /&gt;
Fedor&lt;br /&gt;
&lt;br /&gt;
-- I&#039;m not that great (but absolutely horrid) at essays and I&#039;m alright at research, but if nothing else I have Thursday off and nothing (else) that needs doing by Friday so I can probably spend a bunch of time working on it just before it&#039;s due. -- &#039;&#039;Nick L&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
-- Hay sorry I was unable to attend the meeting after class today. I am not too good at writing essays as well but I am pretty good at summarizing and researching. I am not too sure at what you would like me to do. Right now I&#039;ll assume you need me to research/summarizing articles for the 3 topics above. If you need me to do anything else post it here. I&#039;ll be checking the discussion regularly until this due. once again sorry for missing the meeting-- Paul Cox.&lt;br /&gt;
&lt;br /&gt;
-- Hey i&#039;m also supposed to be in on this. Sorry i couldn&#039;t contribute sooner because i was playing catchup in my other classes. Let me know what i can do and i&#039;ll be on it asap. - kirill (k.kashigin@gmail.com)&lt;br /&gt;
update: i&#039;m gonna be helping Fedor with #2&lt;br /&gt;
&lt;br /&gt;
PS, this article http://docs.google.com/viewer?a=v&amp;amp;q=cache:E7-H_pv_18wJ:citeseerx.ist.psu.edu/viewdoc/download%3Fdoi%3D10.1.1.92.2279%26rep%3Drep1%26type%3Dpdf+flash+memory+and+disk-based+file+systems&amp;amp;hl=en&amp;amp;gl=ca&amp;amp;pid=bl&amp;amp;srcid=ADGEESgspy-jqIdLOpaLYlPPoM56kjLPwXcL3_eMbTTBRkI7PG0jQKl9vIieTAYHubPu0EdQ0V4ccaf_p0S_SnqKMirSIM0Qoq5E0NpLd0M7LAGaE51wkD0F55cRSkX8dnTqx_9Yx2E7&amp;amp;sig=AHIEtbS-yfGI9Y48DJ0WyEEhmsXInelRGw looks really useful for part 3.&lt;br /&gt;
&lt;br /&gt;
PPS, and this article looks really great for understanding how log based file systems work: http://delivery.acm.org/10.1145/150000/146943/p26-rosenblum.pdf?key1=146943&amp;amp;key2=3656986821&amp;amp;coll=GUIDE&amp;amp;dl=GUIDE&amp;amp;CFID=108397378&amp;amp;CFTOKEN=72657973&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Hey Luc (TA) here, Anandtech ran a series of articles on solid state drives that you guys might find useful.  It mostly looked at hardware aspects but it gives some interesting insights on how to modify file systems to better support flash memory.&lt;br /&gt;
&lt;br /&gt;
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&lt;br /&gt;
&lt;br /&gt;
http://www.anandtech.com/storage/showdoc.aspx?i=3531&amp;amp;p=1&lt;br /&gt;
&lt;br /&gt;
http://anandtech.com/storage/showdoc.aspx?i=3631&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
--[[User:3maisons|3maisons]] 19:44, 12 October 2010 (UTC)&lt;br /&gt;
&lt;br /&gt;
Hey Paul&amp;amp;Kirill,&lt;br /&gt;
&lt;br /&gt;
If one of you guys could help me out with #2, that would be really great. I was going to work on that tomorrow, but I also have another large assignment to deal with and not having to do this research would greatly ease my life. Moreover, I do intend to work on writing&amp;amp;polishing the essay on Thursday as I have a lot of experience with that and it far more than research. Let me know if either one of you can help me with this. &lt;br /&gt;
&lt;br /&gt;
The other person could probably read over what Luc posted for us and see if it fits into our framework. Just be sure to state who is going to do what. &lt;br /&gt;
&lt;br /&gt;
Nick, &lt;br /&gt;
&lt;br /&gt;
Honestly, we really hope to have the research done by Thursday. If that is the only day that you are free and you&#039;re not a writer, I&#039;m honestly not sure what you could do. Perhaps someone else can think of something.&lt;br /&gt;
&lt;br /&gt;
- Fedor&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
	<entry>
		<id>https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_10&amp;diff=3126</id>
		<title>Talk:COMP 3000 Essay 1 2010 Question 10</title>
		<link rel="alternate" type="text/html" href="https://homeostasis.scs.carleton.ca/wiki/index.php?title=Talk:COMP_3000_Essay_1_2010_Question_10&amp;diff=3126"/>
		<updated>2010-10-12T22:28:03Z</updated>

		<summary type="html">&lt;p&gt;Kkashigi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Hey all,&lt;br /&gt;
&lt;br /&gt;
I think we should write down our emails here so we can further discuss stuff without having to login here.&lt;br /&gt;
(&#039;&#039;&#039;***Note that discussions over email can&#039;t be counted towards your participation grade!***&#039;&#039;&#039;--[[User:Soma|Anil]])&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Geoff Smith (gsmith0413@gmail.com) - gsmith6&lt;br /&gt;
Andrew Bujáki (abujaki [at] Connect or Live.ca)&lt;br /&gt;
***I&#039;m usually on MSN(Live) for collaboration at nights, Just make sure to put in a little message about who you are when you&#039;re adding me. :)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
I used Google Scholar and came to this page http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=812717&amp;amp;tag=1#&lt;br /&gt;
Which briefly touches on the issues of Flash memory. Specifically, inability to update in place, and limited write/erase cycles.&lt;br /&gt;
&lt;br /&gt;
Inability to update in place could refer to the way the flash disk is programmed, instead of bit-by-bit, it is programmed block-by-block. A block would have to be erased and completely reprogrammed in order to flip one bit after it&#039;s been set.&lt;br /&gt;
http://en.wikipedia.org/wiki/Flash_memory#Block_erasure&lt;br /&gt;
&lt;br /&gt;
Limited write/erase: Flash memory typically has a short lifespan if it&#039;s being used a lot. Writing and erasing the memory (Changing, updating, etc) Will wear it out. Flash memory has a finite amount of writes, (varying on manufacturer, models, etc), and once they&#039;ve been used up, you&#039;ll get bad sectors, corrupt data, and generally be SOL.&lt;br /&gt;
http://en.wikipedia.org/wiki/Flash_memory#Memory_wear&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Filesystems would have to be changed to play nicely with these constraints, where it must use blocks efficiently and nicely, and minimize writing/erasing as much as possible.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
I found a paper that talks about the performance, capabilities and limitations of NAND flash storage. &lt;br /&gt;
&lt;br /&gt;
Abstract: &amp;quot;This presentation provides an in-depth examination of the&lt;br /&gt;
fundamental theoretical performance, capabilities, and&lt;br /&gt;
limitations of NAND Flash-based Solid State Storage (SSS). The&lt;br /&gt;
tutorial will explore the raw performance capabilities of NAND&lt;br /&gt;
Flash, and limitations to performance imposed by mitigation of&lt;br /&gt;
reliability issues, interfaces, protocols, and technology types.&lt;br /&gt;
Best practices for system integration of SSS will be discussed.&lt;br /&gt;
Performance achievements will be reviewed for various&lt;br /&gt;
products and applications. &amp;quot;&lt;br /&gt;
&lt;br /&gt;
Link: http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2009/20090812_T1B_Smith.pdf&lt;br /&gt;
&lt;br /&gt;
There&#039;s no Starting place like Wikipedia, even if you shouldn&#039;t source it. &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Flash_Memory &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/LogFS &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Hard_disk &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Wear_leveling &lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Hot_spot_%28computer_science%29&lt;br /&gt;
&lt;br /&gt;
http://en.wikipedia.org/wiki/Solid-state_drive&lt;br /&gt;
&lt;br /&gt;
Hey Guys,&lt;br /&gt;
&lt;br /&gt;
We really don&#039;t have much time to get this done. Lets meet tomorrow after class and get our bearings to do this properly.&lt;br /&gt;
&lt;br /&gt;
Fedor&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A few of us have Networking immediately after class. I know personally I won&#039;t be able to make anything set on Tuesday.&lt;br /&gt;
Additionally, he spoke briefly about hotspots on the disk for our question last week, where places on the disk would be written to far more often than others. &lt;br /&gt;
As well, for bibliographical citing, http://bibme.org is a wonderful resource for the popular formats (I.e. MLA). If it should come down to that.&lt;br /&gt;
~Andrew&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===links===&lt;br /&gt;
&lt;br /&gt;
Start Posting some stuff to source from:&lt;br /&gt;
&lt;br /&gt;
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1199079&amp;amp;tag=1&lt;br /&gt;
--&amp;quot;Introduction to flash memory&amp;quot;&lt;br /&gt;
&lt;br /&gt;
http://portal.acm.org/citation.cfm?id=1244248&lt;br /&gt;
--&amp;quot;Wear Leveling&amp;quot; (it&#039;s about a proposed way of doing it, but explains a whole bunch of other things to do that)&lt;br /&gt;
&lt;br /&gt;
http://portal.acm.org/citation.cfm?id=1731355&lt;br /&gt;
--&amp;quot;Online maintenance of very large random samples on flash storage&amp;quot; (ie dealing with the constraints of Flash Storage in a system that might actually be written to 100000 times)&lt;br /&gt;
&lt;br /&gt;
Hi everybody,&lt;br /&gt;
&lt;br /&gt;
So here are the latest news. Geoff, Andrew and myself had a meeting after class today and came up with a plan for writing this thing. &lt;br /&gt;
&lt;br /&gt;
We decided to have 3 parts:&lt;br /&gt;
&lt;br /&gt;
1. What flash storage is, why its good but also why it must have the problems that it does (the assumption is that it must have them, why would it otherwise?)&lt;br /&gt;
[don&#039;t know much about this just now... basics include that there is NOR (reads slightly faster)and NAND (holds more, writes faster, erases much faster, lasts about ten times longer) flash with NAND being especially popular for storage (what&#039;s NOR good for?). Here, we&#039;d ideally want to talk about why flash was invented (supposed as an alternative to slow ROM), why it was suitable for that, and how it works on a technical level. Then, we&#039;d want to mention why this technical functionality was innovative and useful but also why it came with two serious set-backs: having a limited-number of re-write cycles and needing to erase a block at a time.]&lt;br /&gt;
&lt;br /&gt;
2. How a traditional disk-based file-system works and why the limitations of flash storage make the two a poor match&lt;br /&gt;
[the obvious answer seems to be that traditional file-systems could just write to whatever memory was available but if they did this with a flash file-systems, certain chunks of memory would become unusable before others and the memory would be more difficult to work with. Also, disk-based file systems need to deal with seeking times which means that they want to organize their data in such a way as to reduce those (by putting related things together?) - with Flash, this isn&#039;t really a problem and thus one constraint the less to be concerned with.]&lt;br /&gt;
&lt;br /&gt;
3. How a log based file-system works and why this method of operation is so well suited to working with flash memory especially in light of the latter&#039;s inherent limitations&lt;br /&gt;
[...]&lt;br /&gt;
&lt;br /&gt;
At this time, the plan is that Geoff will work on #3 today, Andrew will work on #1 tomorrow and I will work on #2 tomorrow. The three of us will make an effort to consult some somewhat more painfully technical literature in order to gain insight into our respective queries. Whatever insight we find will be posted here. &lt;br /&gt;
&lt;br /&gt;
Then, we will meet again on Thursday after class to decide how to actually write the essay.&lt;br /&gt;
&lt;br /&gt;
PS, if there is anybody in the group besides the three of us - let us know so you can find a way to contribute to this... as at least two of us are competent essayists, painfully technical research would on one or more of the above topics would be a great way to contribute... especially if you could post it here prior to one of us going over the same thing. &lt;br /&gt;
&lt;br /&gt;
Fedor&lt;br /&gt;
&lt;br /&gt;
-- I&#039;m not that great (but absolutely horrid) at essays and I&#039;m alright at research, but if nothing else I have Thursday off and nothing (else) that needs doing by Friday so I can probably spend a bunch of time working on it just before it&#039;s due. -- &#039;&#039;Nick L&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
-- Hay sorry I was unable to attend the meeting after class today. I am not too good at writing essays as well but I am pretty good at summarizing and researching. I am not too sure at what you would like me to do. Right now I&#039;ll assume you need me to research/summarizing articles for the 3 topics above. If you need me to do anything else post it here. I&#039;ll be checking the discussion regularly until this due. once again sorry for missing the meeting-- Paul Cox.&lt;br /&gt;
&lt;br /&gt;
-- Hey i&#039;m also supposed to be in on this. Sorry i couldn&#039;t contribute sooner because i was playing catchup in my other classes. Let me know what i can do and i&#039;ll be on it asap. - kirill (k.kashigin@gmail.com)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
PS, this article http://docs.google.com/viewer?a=v&amp;amp;q=cache:E7-H_pv_18wJ:citeseerx.ist.psu.edu/viewdoc/download%3Fdoi%3D10.1.1.92.2279%26rep%3Drep1%26type%3Dpdf+flash+memory+and+disk-based+file+systems&amp;amp;hl=en&amp;amp;gl=ca&amp;amp;pid=bl&amp;amp;srcid=ADGEESgspy-jqIdLOpaLYlPPoM56kjLPwXcL3_eMbTTBRkI7PG0jQKl9vIieTAYHubPu0EdQ0V4ccaf_p0S_SnqKMirSIM0Qoq5E0NpLd0M7LAGaE51wkD0F55cRSkX8dnTqx_9Yx2E7&amp;amp;sig=AHIEtbS-yfGI9Y48DJ0WyEEhmsXInelRGw looks really useful for part 3.&lt;br /&gt;
&lt;br /&gt;
PPS, and this article looks really great for understanding how log based file systems work: http://delivery.acm.org/10.1145/150000/146943/p26-rosenblum.pdf?key1=146943&amp;amp;key2=3656986821&amp;amp;coll=GUIDE&amp;amp;dl=GUIDE&amp;amp;CFID=108397378&amp;amp;CFTOKEN=72657973&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Hey Luc (TA) here, Anandtech ran a series of articles on solid state drives that you guys might find useful.  It mostly looked at hardware aspects but it gives some interesting insights on how to modify file systems to better support flash memory.&lt;br /&gt;
&lt;br /&gt;
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&lt;br /&gt;
&lt;br /&gt;
http://www.anandtech.com/storage/showdoc.aspx?i=3531&amp;amp;p=1&lt;br /&gt;
&lt;br /&gt;
http://anandtech.com/storage/showdoc.aspx?i=3631&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
--[[User:3maisons|3maisons]] 19:44, 12 October 2010 (UTC)&lt;/div&gt;</summary>
		<author><name>Kkashigi</name></author>
	</entry>
</feed>