Question

How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?

Answer

A thread is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability. UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability.
Taken the liberty to add Praubic's tentative first para. No changes made as of yet.

Scalable Threads: The Problems

One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them):

Cache-Coherency:There can be some instances of misplaced information in the cache that can cause a "cache-coherency operation" to be called. This operation is comparatively expensive. Once misplaced information that causes this problem is identified it can be moved to limit the problem.

Locks:There can also be some user-called locks that contribute to bottlenecks. One such lock is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, leading to starvation. This problem was solved by using a lockless-read.

Scheduler:The multiqueue scheduler is the third major bottle neck. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time. Whilem, the rest went into computing and recomputing information in the cache. These problems were fixed by replacing the scheduler,. The scheduler was then replaced by a more efficient scheduler [O(1) scheduler].

CPU:The next few bottle necks are related. They're both examples of course-granularity locks eating CPU time. Granularity refers to the execution time of a code segment. The closer a segment is to the speed of an atomic action the finer its granularity. Another course-grained bottleneck was the dcache_lock. It ate up a adequate amount of time in normal use. But it was also called in the much more popular dnotify_parent() function, which made it unacceptable. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux.

Kernel Lock:One big course-grained bottleneck in the system is the "Big Kernel Lock" (BKL) linux's kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL's usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottle necks.

MAIN POINT 2 Paragraph draft --Praubic 00:21, 14 October 2010 (UTC) still in progress and debating

Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes.

Ok for point 2 -> I posted a draft on the essay page but Im not certain as to whether i should talk about fibers since they are also functioning on user space but theyre not UMS. --Praubic

Design Choices

(A) Kernel Threads and User Threads (1:1 vs M:N)
--Gautam 00:29, 14 October 2010 (UTC)
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.
(B)Signal Handling
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.
(C)Synchronization
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.
(D)Memory Management
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread's data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads. The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it's stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it's control block, this allows for the thread's stack to be allocated at the thread's startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it's arguments out of it's control block and allocate it's memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation. The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [1] hirving
(E)Scheduling Priorities
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks. Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency. --Praubic 18:24, 13 October 2010 (UTC)

COMP 3000 Essay 1 2010 Question 7

Contents

Question

Answer

Scalable Threads: The Problems

Design Choices

References

Navigation menu

COMP 3000 Essay 1 2010 Question 7

Question

Answer

Scalable Threads: The Problems

Design Choices

References

Navigation menu

Search