COMP 3000 Essay 1 2010 Question 7
Question
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?
Answer
A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable User threads on the other hand are mapped to kernel threads by the threads library such as libpthreads. and there are a few designs that incorporate it mainly Fibers and UMS (User Mode Scheudling) which allow for very high scalability. UMS threads have their own context and resources however the ability to switch in the user mode makes them more efficient (depending on application) than Thread Pools which are yet another mechanism that allows for high scalability.
Taken the liberty to add Praubic's tentative first para. No changes made as of yet.
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks. When porting Linux up to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them).
There can be some instances of misplaced information in the cache that can cause a "cache-coherency operation" to be called. This operation is comparatively expensive. Once misplaced information that causes this problem frequently is identified it can be moved to limit the problem.
There can also be some user-called locks that contribute to bottlenecks. One such lock is the xtime_lock in Linux. Having locking writing was prevented writing to the timer value, leading to starvation. This problem was solved by using a lockless-read.
The multiqueue scheduler is a third major bottle neck. Altogether the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time. The rest went into computing and recomputing information in the cache. These problems were fixed by replacing the scheduler, which was then replaced by a more efficient scheduler [O(1) scheduler].
The next big bottle necks are related. They're both examples of course-granularity locks eating CPU time. Granularity refers to the execution time of a code segment. The closer a segment is to the speed of an atomic action the finer its granularity.
One big course-grained bottleneck in the system is the "Big Kernel Lock" (BKL) linux's kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The prefered method on Linux NUMA systems was to limit the BKL's usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottle necks.
The next course-grained bottleneck was the dcache_lock. It ate up a modest amount of time in normal use. But it was also called in the much more popular dnotify_parent() function, which made it non-acceptable. So the dcache_lock strategy was replaced with a strategy from a later implementation of linux. The later strategy was finer-grained.
Design Choices
(A) Kernel Threads and User Threads (1:1 vs M:N)
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. On the other hand the M:N would implement a complicated library though it would offer advantages in areas of signal handling. However a general consensus was that the M:N design was not compatible with the Linux kernel. Also the cost to implement the same would have been to high. Hence the 1:1 model was decided upon.
(B)Signal Handling
The kernel implements the POSIX signal handling. The kernel has to handle the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. Obviously this helps only if the 1-on-1 model is used.
(C)Synchronization
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application.Fortunately some new functionality was added to the kernel to implement all kinds of synchronization.
(D)Memory Allocation
One of the goals for the library is to have low startup costs for threads so that scalability is possible.The biggest problem time-wise outside the kernel is the memory needed for the thread data structures, thread-local storage, and the stack. This is to be done by optimizing the memory allocation for the threads.
(E)Scheduling Priorities
<Paul:Please fill this section out. -Gautam>
Consider it done around 1 or 2pm Wednesday, Paul