Question

How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?

Answer

The Background

A 'process' is defined to be "an address-space and a group of resources dedicated to running the program". On the other hand a 'thread' is an independent sequential unit of computation that executes within the context of a kernel supported entity like a 'process'. Threads are often classified by their “weight” (or overhead), which corresponds to the amount of context that must be saved when a thread is removed from the processor, and restored when a thread is reinstated on a processor that is a context switch. The context for a process usually includes the hardware register, kernel stack, user-level stack, interrupt vectors, page tables, and more. Threads require less system resources then concurrent cooperating processes and start much easier, therefore, there may exist millions of them in a single process. Loosely based on this there are two major types of threads: kernel and user-mode. Kernel threads are usually considered heavier and designs that involve them are not very scalable. User threads, on the other hand, are mapped to kernel threads and lightwieght. The ratio of the user threads to kernel threads is an important factor when designing scalable systems.

There are a few designs, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability. UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools, which are yet another mechanism that allows for high scalability. Systems can support millions of threads within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.

Scalable Threads: The Problems

One of the basic challenges is to create code which is stable and at the same time scalable. Furthermore, the challenge in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. Ray Bryant and John Hawkes found the following bottlenecks when porting Linux to a 64-core NUMA system. Each of these bottle necks are an example of a type of bottleneck that can appear in any program.[1]

When expensive operations are needlessly called one type of bottleneck appears. In Linux, there can be some instances of misplaced information in the cache that can cause a "cache-coherency operation" to be called. This operation is expensive when compared to what would happen if the information was in the 'right place'. Once the misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times, this bottleneck can appear (this problem is not inherent, but is a result of bad-design).

Another type of bottleneck is from starvation. An example of one such bottleneck is the xtime_lock in Linux. Having locking read prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.

The next type of bottleneck is from course-grained operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. This was deemed an unacceptable state of affairs. So, the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the "Big Kernel Lock" (BKL) linux's kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL's usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity.

Bottlenecks can be from multiple problems. One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems: the spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. The Scheduler also had O(n) time complexity which essentially meant that the scheduler had scalability issues and would become inefficient after a particular number of processes. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler with a O(1)time complexity which meant that any number of threads/processes could be scheduled without any overhead.

Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements that handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler[2]. This could turn out inefficient depending whether there are many related threads that could end up starving other processes.

Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level[3]. Unlike UMS, fibers do not utilize multiprocessor machines, however, they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished.

Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed, they sleep or wait. This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance [4]. The long living threads can, for instance, handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.

Design Choices

(A) Kernel Threads and User Threads (1:1 vs M:N)
This is the most basic design and the lightweight process. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Management and scheduling is done through thread management. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model. Thread aware operating system is found on Windows XP, Windows 2000, Windows Vista and any latest operating system.

(B)Signal Handling
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.

(C)Synchronization
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.

Explaining the four types of synchronization:

Mutex locks uses only a thread thus giving access to only certain part of the code
Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released
Condition variable synchronization protects the thread until the condition becomes true
Counting semaphores delivers access to multiple threads. It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.

(D)Memory Management
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread's data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads.

The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it's stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it's control block, this allows for the thread's stack to be allocated at the thread's startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it's arguments out of it's control block and allocate it's memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.

Locks are an important part of the performance of threads and there are multiple way of controlling and creating locks in order to create a large amount of threads. Single lock (having the data structures all be in one lock) has the advantage that once the processor has acquired the lock it can modify any of the stored data. Using the single lock method means only one lock is needed per thread, decreasing the thread overhead but this also limits the throughput of the system. Multiple lock (having each data structure have it's own lock) has the advantage of that each action on the data structure is it's own locking/unlocking operations. Multiple has greater thread overhead (because there are more locks) but the thread throughput is much higher allowing for fast creation of threads. Another downside of multiple lock systems are deadlocks, a deadlock happens when two different threads are waiting for data that the other task holds. Single and multiple lock systems are the inverse of each other and using both depending on the situation can greatly increase the performance of a system.

The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [5]

(E)Scheduling Priorities
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks. Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.[6][7], Microsoft (23 September 2010)

References

Native POSIX Threads http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.6590&rep=rep1&type=pdf
Linux Symposium, pg83 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1.7621&rep=rep1&type=pdf#page=83
PicoThreads: Lightweight Threads in Java http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.9043&rep=rep1&type=pdf
UMS Threading http://msdn.microsoft.com/en-us/library/dd627187(VS.85).aspx
Fibers http://msdn.microsoft.com/en-us/library/ms682661.aspx
Thread pools queues http://www.ibm.com/developerworks/java/library/j-jtp0730.html
On the design of Chant: a talking threads package http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=344298