COMP 3000 Essay 1 2010 Question 7

From Soma-notes

Question

How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?

Answer

A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability. UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability.
Taken the liberty to add Praubic's tentative first para. No changes made as of yet.


One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them):

There can be some instances of misplaced information in the cache that can cause a "cache-coherency operation" to be called. This operation is comparatively expensive. Once misplaced information that causes this problem is identified it can be moved to limit the problem.

There can also be some user-called locks that contribute to bottlenecks. One such lock is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, leading to starvation. This problem was solved by using a lockless-read.

The multiqueue scheduler is the third major bottle neck. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time. Whilem, the rest went into computing and recomputing information in the cache. These problems were fixed by replacing the scheduler,. The scheduler was then replaced by a more efficient scheduler [O(1) scheduler].

The next few bottle necks are related. They're both examples of course-granularity locks eating CPU time. Granularity refers to the execution time of a code segment. The closer a segment is to the speed of an atomic action the finer its granularity.

One big course-grained bottleneck in the system is the "Big Kernel Lock" (BKL) linux's kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL's usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottle necks.

The last course-grained bottleneck was the dcache_lock. It ate up a adequate amount of time in normal use. But it was also called in the much more popular dnotify_parent() function, which made it unacceptable. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux.


Design Choices

(A) Kernel Threads and User Threads (1:1 vs M:N)
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.
(B)Signal Handling
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.
(C)Synchronization
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.
(D)Memory Management
One of the goals for the library is to have low startup costs for threads so that scalability is possible. The biggest problem, time-wise, outside the kernel is the memory needed for the thread data structures, thread-local storage, and the stack. This is corrected by optimizing the memory allocation for the threads. <Working on this section> hirving
(E)Scheduling Priorities
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows (and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux). All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks. Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency. --Praubic 18:24, 13 October 2010 (UTC)

References