COMP 3000 Essay 1 2010 Question 7: Difference between revisions
Line 61: | Line 61: | ||
== References == | == References == | ||
<references/> |
Revision as of 16:14, 14 October 2010
Question
How is it possible for systems to supports millions of threads or more within a single process? What are the key design choices that make such systems work - and how do those choices affect the utility of such massively scalable thread implementations?
Answer
Process is known as an instance of a program running in a computer which has its own resources such as address space, files, I/O devices and thread on the other hand is an independent task that executes in the same address space as other threads within a single process while sharing data synchronously and it can either execute the same code or a different code within the same application because it has its own state, run-time stack and execution context. Threads require less system resources then concurrent cooperating processes and start much easier therefore there may exist millions of them in a single process. The two major types of threads are kernel and user-mode. Kernel threads are usually considered more heavy and designs that involve them are not very scalable. User threads on the other hand, are mapped to kernel threads by the threads library such as libpthreads. There are a few designs that incorporate it, mainly Fibers and UMS (User Mode Scheduling) which allow for very high scalability. UMS threads have their own context and resources. However, the ability to switch in the user mode makes them more efficient (depending on the application) than Thread Pools which are yet another mechanism that allows for high scalability. Systems can support millions within a single process by switching execution resources between threads, creating a concurrent execution. Concurrency is the result of multiple threads staying on the queues but is incapable of running them at the same time. It provides the impression that they are executing at the same time due to the speed they switch at.
vG && Paul
Taken the liberty to add Praubic's tentative first para. and i have added my version to pauls and modified it vG
Scalable Threads: The Problems
One of the challenges in making an existing code base scalable is the identification and elimination of bottlenecks once scaled. When porting Linux to a 64-core NUMA system Ray Bryant and John Hawkes found the following bottlenecks (or just wrote a paper about them). Each of these bottle necks is an example of a type of bottleneck that can appear in any program.
When expensive operations are needlessly called one type of bottleneck appears. In Linux there can be some instances of misplaced information in the cache that can cause a "cache-coherency operation" to be called. This operation is expensive compared to what would happen if the information was in the 'right place'. Once misplaced information that causes this problem all the time is identified it can be moved to limit the problem. Anywhere expensive operations are called a needless number of times this bottleneck can appear (this problem is not inherent, but is a result of bad-design).
Another type of bottleneck is from starvation. One such bottleneck is the xtime_lock in Linux. Having locking reading prevented writing to the timer value, causing the kernel to waste CPU time to keep trying. This problem was solved by using a lockless-read. This problem would appear anywhere that a thread must keep trying to execute, but cannot, leading to wasted CPU cycles.
The next type of bottleneck is from course-grained operations. Granularity refers to the execution time of a code segment. Both examples eat alot of CPU time, where a finer-grained implementation would eat less. The closer a segment is to the speed of an atomic action the finer its granularity. One course-grained bottleneck was the dcache_lock. It ate up some time in normal use, but it was also called in the much more popular dnotify_parent() function. That was an unacceptable state of affairs. So the dcache_lock strategy was replaced with a finer-grained strategy from a later implementation of linux. Another big course-grained bottleneck in the system is the "Big Kernel Lock" (BKL) linux's kernel synchronization control. Waiting for the BKL took up as much as 70% of the CPU time on a system with only 28 cores. The preferred method, on Linux NUMA systems, was to limit the BKL's usage. The ext2 and ext3 file systems were replaced with a file system that uses finer-grained locking (XFS), reducing the impact of the bottleneck. Both those examples are the result of course granularity.
Bottlenecks can be from multiple problems. One example of that is the multiqueue scheduler from linux 2.4. Altogether, the multiqueue scheduler ate up 25% of the CPU time. It had two problems. The spinlock ate up a fair majority of the CPU time, it was course-grained. While, the rest went into computing and recomputing information in the cache, a needless expensive operation. These problems were fixed by replacing the scheduler (That scheduler was then replaced by a more efficient scheduler [O(1) scheduler]).
--Rannath
MAIN POINT 2 Paragraph draft --Praubic 00:21, 14 October 2010 (UTC) still in progress and debating
Introduction of Windows NT and OS/2 brought about innovation that provides cheap threading while having expensive processing. UMS which reflects such design is a recommended mechanism for high performance requirements which handle many threads on multicore systems. A scheduler has to be implemented to manage the UMS threads and decide when they should be run or stopped. This implementation is not desirable for moderate performance systems because concurrent execution of this sort naturally allows for non-intuitive outcomes or behaviors such as race condition which requires careful programming and design choices. The framework used by UMS threading is divided into smaller abstractions depending on the final desired utility. For instance, UMS scheduling can be assigned to each logical processor and thereby creating affinity for related threads to function around one scheduler. This could turn out inefficient depending whether there are many related threads that could end up starving other processes.
Fibers embrace essentially the same abstraction as coroutines. The distinction emerges from the fact that fibers are on the system level while coroutines execute on the language level. Unlike UMS, fibers do not utilize multiprocessor machines however they require less operating system support. Symbian Operating System presents an example of fibers usage in its Active Scheduler. An object of active scheduler contains a single fiber that is scheduled when an asynchronous call returns and blocks lower priority fibers until all above are finished.
Thread Pools consist of queues of threads that stay open and await new tasks to become assigned to them. If there is no new tasks to be completed they sleep or wait.This pattern eliminates the overhead of creation and destruction of threads which reflects in better system stability and improved performance. The long living threads can for instance handle multiple transaction requests from socket connection from other machines over a short time frame while at the same time avoiding the millions of cycles to drop/reestablish a thread. Often, thread pool operate on server farms and therefore thread-safety has to be carefully implemented.
Design Choices
(A) Kernel Threads and User Threads (1:1 vs M:N)
--Gautam 00:29, 14 October 2010 (UTC)
This is the most basic design choice. The 1:1 boasts of a slim clean library interface on top of the kernel functions. Although, the M:N would implement a complicated library, it would offer advantages in areas of signal handling. A general consensus was that the M:N design was not compatible with the Linux kernel due to such a high cost for implementation. This gave birth to the 1:1 model.
(B)Signal Handling
The kernel implements the POSIX signal handling for use with the multitude of signal masks. Since the signal will only be sent to a thread if it is unblocked, no unnecessary interruptions through signals occur. The kernel is also in a much better situation to judge which is the best thread to receive the signal. This only holds true if the 1-on-1 model is used.
(C)Synchronization
The implementation of the synchronization primitives such as mutexes, read-write locks, conditional variables, semaphores, and barriers requires some form of kernel support. Busy waiting is not an option since threads can have different priorities (beside wasting CPU cycles). The same argument rules out the exclusive use of sched yield. Signals were the only viable solution for the old implementation. Threads would block in the kernel until woken by a signal. This method has severe drawbacks in terms of speed and reliability caused by spurious wakeups and derogation of the quality of the signal handling in the application. Fortunately, new functionality was added to the kernel to implement all kinds of synchronization.
Explaining the four types of synchronization:
- Mutex locks uses only a thread thus giving access to only certain part of the code
- Using Read/Write synchronization one can gain exclusive write and read access to protected resource but to edit the content it must have the exclusive write lock. Exclusive write lock is only permitted when all the read locks are released
- Condition variable synchronization protects the thread until the condition becomes true
- Counting semaphores delivers access to multiple threads. It has a count which keeps tracks of the number of threads can have concurrent access to the data. Once the limit is reached other threads are blocked until the limit changes.
vG
(D)Memory Management
Thread memory management is an important design choice when attempting to create a large amount of threads in a single process, from creation to maintenance and deallocation. A thread's data structure is made up of a program counter, a stack and a control block. A control block of a thread is needed for thread management as it contains the state data of a thread. The optimization of this data structure can greatly increase performance in large number of threads.
The creation of a thread can take place before the process actually requires it to run and wait until a idle processor becomes available to run the thread. Thread overhead (the required memory, CPU time, and read/write time to initialize the thread) is a problem that can arise with this creation process, since it frontloads the process. Another problem with this creation process is that the thread must allocate the memory required for it's stack at creation because it is expensive to dynamically allocate the stack memory. A way to optimize this creation process for large amounts of threads is to copy the arguments of the thread into it's control block, this allows for the thread's stack to be allocated at the thread's startup (when the thread starts being used) and not when the thread is created. When the thread enters startup it can copy it's arguments out of it's control block and allocate it's memory. Thread creation is ruled by latency (the cost of thread management on the system) and throughput (the rate that the system can create, start, and finish threads that are in contention), and, if thread memory management is done in a serial processing manner, these two factor combine to create a maximum rate of thread creation.
The deallocation of a thread can also be optimized for use in increasing the scalability of threads. Storing deallocted stacks and control blocks in a free list allows the process of allocation and deallocation to be a list operation, if they are not stored in a free list then the thread overhead would include finding the correct size of free memory to store the stack. [1] hirving
(E)Scheduling Priorities
A thread is an entity that can be scheduled according to its scheduling priority which is a number ranging from 0 to 31 for Windows and a Red-Black Tree used by the CFS (Completely Fair Scheduler) in Linux. All threads are executed in a time splice assigned to them in round robin fashion and lower priority threads wait until the ones above finish performing their tasks. Threads are composed of thread context which internally breaks down into set of machine registers, the kernel and user stack all linked to the address space of the process where the thread resides. A context switch occurs as the time splice elapses and an equal (or higher) priority thread becomes available and it is responsible for allowing high scalability if it is efficiently implemented. For example fibers which are executed entirely in userspace do not require a system call during a switch which highly increases efficiency.<ref>Scheduling Priorities</ref>, Microsoft (23 September 2010) --Praubic 18:24, 13 October 2010 (UTC)
References
<references/>