Question

Compare and contrast the evolution of the default BSD/FreeBSD and Linux schedulers.

Answer

One of the most difficult problems that operating systems must handle is process management. In order to ensure that a system will run efficiently, processes must be maintained, prioritized, categorized and communicated with, all without experiencing critical errors such as race conditions or process starvation. A critical component in the management of such issues is the operating system’s scheduler. The goal of a scheduler is to ensure that all processes of a computer system get access to the system resources they require as efficiently as possible while maintaining fairness for each process, limiting CPU wait times, and maximizing the throughput of the system (Jensen: 1985).

There are several different algorithms which are utilized in different schedulers, but a few key algorithms are outlined below[1][2][3]:

First-Come, First-Serve (also known as FIFO): No multi-tasking. Processes are queued in the order they are called. A process gets full, uninterrupted use of the CPU until it has finished running.
Shortest Job First (similar to Shortest Remaining Time and/or Shortest Process Next): Limited multi-tasking. The CPU handles the easiest tasks first, and complex, time-consuming tasks are handled last.
Fixed-Priority Preemptive Scheduling: Greater multi-tasking. Processes are assigned priority levels which are independent of their complexity. High-priority processes can be completed quickly, while low-priority processes can take a long time as new, higher-priority processes arrive and interrupt them.
Round-Robin Scheduling: Fair multi-tasking. This method is similar in concept to First-Come, First-Serve, but all processes are assigned the same priority level; that is, every running process is given an equal share of CPU time.
Multilevel Feedback Queue Scheduling: Rule-based multi-tasking. It is a combination of First-Come, First-Serve, Round-Robin & Fixed-Priority Preemptive Scheduling, but processes are associated with groups that help determine how high their priorities are. For example, all I/O tasks get low priority since much time is spent waiting for the user to interact with the system.

There is no one "best" algorithm and most schedulers utilize a combination of the different algorithms, such as the Multi-Level Feedback Queue, which in one way or another was utilized in Win XP/Vista, Linux 2.5-2.6, FreeBSD, Mac OSX, NetBSD and Solaris.
One thing for certain is that as computer hardware increases in complexity, such as multiple core CPUs (parallelization), and with the advent of more powerful embedded/mobile devices, schedulers of operating systems have similarly evolved to handle these additional challenges. In this article we will compare and contrast the evolution of two such schedulers:

The default BSD/FreeBSD scheduler
The Linux scheduler

BSD/Free BSD Schedulers

Overview & History

The FreeBSD kernel originally inherited its scheduler from 4.3BSD which itself is a version of the UNIX scheduler [4]. In order to understand the evolution of the FreeBSD scheduler it is important to understand the original purpose and limitations of the BSD scheduler. Like most traditional UNIX based systems, the BSD scheduler was designed to work on a single core computer system (with limited I/O) and handle relatively small numbers of processes. As a result, managing resources with an O(n) scheduler did not raise any performance issues. To ensure fairness, the scheduler would switch between processes every 0.1 second (100 milliseconds) in a round-robin format [5].

As computer systems increased in complexity with the advent of multi-core CPUs and various new I/O devices, computer programs, naturally, increased in size and complexity to accommodate and manage the new hardware. With CPUs becoming more powerful (derived from Moore's Law [6]), the time taken to complete a process decreased significantly. This additional complexity highlighted the problem of having an O(n) scheduler for managing processes, as more items were added to the scheduling algorithm, the performance decreased. With symmetric multiprocessing (SMP) becoming inevitable (multi-core CPU's) a better scheduler was required. This was the driving force behind the creation of ULE for the FreeBSD.

Older Versions

The FreeBSD kernel originally used an enhanced version of the BSD scheduler. Specifically, the FreeBSD scheduler included classes of threads, which was a drastic change from the round-robin scheduling used in BSD. Initially, there were two types of thread class, real-time and idle [7], and the scheduler would give processor time to real-time threads first and the idle threads had to wait until there were no real-time threads that needed access to the processor.

To manage the various threads, FreeBSD had data structures called runqueues, into which the threads were placed. The scheduler would evaluate the runqueues based on priority from highest to lowest and execute the first thread of a non-empty runqueue it found. Once a non-empty runqueue was found, each thread in the runqueue would be assigned an equal value time slice of 0.1 seconds, a value that has not changed in over 20 years [8].

Unfortunately, like the BSD scheduler it was based on, the original FreeBSD scheduler was not built to handle Symmetric Multiprocessing (SMP) or Symmetric Multithreading (SMT) on multi-core systems. The scheduler was still limited by an O(n) algorithm, which could not efficiently handle the loads required on ever increasingly powerful systems. To allow FreeBSD to operate with more modern computer systems, it became clear that a new scheduler would be required, and thus, the ULE scheduler was created.

Current Version

ULE was first implemented as part of an "experimental" process (by Jeff Roberson) in FreeBSD v5.1, before being added to the FreeBSD v5.3 development cycle. It was designed with modern hardware and requirements in mind and had proper support for Symmetric Multi-Processing (SMP) (and HTT), Symmetric Multi-Threading (SMT) platforms and handle heavy workloads. Primarily being an event-driven scheduler, ULE utilized a double-queue mechanism (borrowed from Linux's O(1) scheduler) for ensuring fairness. This mechanism is briefly outlined as follows[9]:

Process threads are assigned in 2 queues, 'current' and 'next'
Each thread is either assigned to 'current' or 'next'
Process execution first begins in the 'current' queue (priority based)
Once 'current' is empty, the 'next' and 'current' queues are switched and the threads are executed in a similar manner (priority based)
All idle threads are stored in a third queue, 'idle' and is run only when 'current' and 'next' are empty

It has been implemented as the default scheduler since v7.1 onwards. ULE works really well on both single or uni-processor environments as well as multi-core environments. It prevents unnecessary CPU migration, while making good use of CPU resources. However, 2 key practical problems arose due to the double-queue mechanism [10]:

Implementing queues meant more memory usage and cache hits
Having multiple threads on different queues caused problems for the rest of the system

This problem was resolved by implementing a circular queue, inspired by the calendar queue data structure. Since there is only one queue, there is little extra cpu overhead. There is more flexibility in deciding how much runtime each thread gets relative to those with a lower priority. With these design approaches in mind, ULE is also slightly faster on uni-processor systems than the 4.4BSD.

Linux Schedulers

Overview & History

Unlike its competitor, FreeBSD, the Linux scheduler has undergone several different cycles of evolution [11], always aiming towards having a fair and fast scheduler. Various methods and concepts have been tried over different versions such as round robin, iterations, and queues. A quick read through of the history of Linux implies that firstly, equal and balanced use of the system was the goal of the scheduler (to ensure fairness). Eventually, the speed issue was also tackled to ensure that the run times were not affected. Early schedulers did their best to give processes equal time and resources, but used a bit of extra time (in computer terms) to accomplish this. With the advent of more complicated hardware (multi-core CPU's), Linux also faced new challenges, as experienced by BSD. Various techniques were employed to accommodate for these changes ranging from the inefficient O(n) scheduler to the CFS (current version) [12].

Older Versions

In Linux 1.2 a scheduler operated with a round robin policy using a circular queue, allowing the scheduler to be efficient in adding and removing processes. When Linux 2.2 was introduced, the scheduler was changed and used the idea of scheduling classes, thus allowing it to schedule real-time tasks, non real-time tasks, and non-preemptible tasks. It was the first scheduler that supported SMP.

With the introduction of Linux 2.4, the scheduler was changed again. The scheduler started to be more complex than its predecessors, but it also provided more features. The running time was O(n) because it iterated over each task during a scheduling event. The scheduler divided tasks into epochs, allowing each task to execute up to its time slice. If a task did not use up its entire time slice, the remaining time was added to the next time slice to allow the task to execute longer in its next epoch. The scheduler simply iterated over all tasks, which made it inefficient, low in scalability, and did not have a useful support for real-time systems. On top of that, it did not have features to exploit new hardware architectures, such as multi-core processors.

Current Version

As of the Linux 2.6.23 introduction, the CFS (Completely Fair Scheduler) took its place in the kernel. CFS uses the idea of maintaining fairness in providing processor time to tasks, which means each tasks gets a fair amount of time to run on the processor. When the time task is out of balance, it means the tasks has to be given more time because the scheduler has to keep fairness. To determine the balance, the CFS maintains the amount of time given to a task, which is called a virtual runtime.

The model of how the CFS executes has changed, too. The scheduler now runs a time-ordered red-black tree. It is self-balancing and runs in O(log n) where n is the amount of nodes in the tree, allowing the scheduler to add and erase tasks efficiently. Tasks with the most need of processor are stored in the left side of the tree. Therefore, tasks with a lower need of cpu are stored in the right side of the tree. To keep fairness, the scheduler takes the left-most node from the tree. The scheduler then accounts execution time at the CPU and adds it to the virtual runtime. If runnable, the task then is inserted into the red-black tree. This means tasks on the left side are given time to execute, while the contents on the right side of the tree are migrated to the left side to maintain fairness.

Under a recent Linux system (version 2.6.35 or later), scheduling can be handled manually by the user by assigning programs different priority levels, called "nice levels." Put simply, the higher a program's nice level is, the nicer it will be about sharing system resources. A program with a lower nice level will be more greedy, and a program with a higher nice level will more readily give up its CPU time to other, more important programs. This spectrum is not linear; programs with high negative nice levels run significantly faster than those with high positive nice levels. The Linux scheduler accomplishes this by sharing CPU usage in terms of time slices (also called quanta), which refer to the length of time a program can use the CPU before being forced to give it up. High-priority programs get much larger time slices, allowing them to use the CPU more often and for longer periods of time than programs with lower priority. Users can adjust the niceness of a program using the shell command nice( ). Nice values can range from -20 to +19.

In previous versions of Linux, the scheduler was dependent on the clock speed of the processor. While this dependency was an effective way of dividing up time slices, it made it impossible for the Linux developers to fine-tune their scheduler to perfection. In recent releases, specific nice levels are assigned fixed-size time slices instead. This keeps nice programs from trying to muscle in on the CPU time of less nice programs, and also stops the less nice programs from stealing more time than they deserve.

In addition to this fixed style of time slice allocation, Linux schedulers also have a more dynamic feature, which causes them to monitor all active programs. If a program has been waiting an abnormally long time to use the processor, it will be given a temporary increase in priority to compensate. Similarly, if a program has been hogging CPU time, it will temporarily be given a lower priority rating.

Conclusion

A few key differences between the Linux & FreeBSD schedulers are outlined as below:

ULE runs in O(1) time whereas CFS runs in O(log(n)) time
ULE is based on run-queues whereas CFS utilizes a red-black tree
Context switching is done much faster by ULE than Linux [13]

Hardware innovation has primarily been the driving force behind the evolution of the schedulers and with the ever increasing complexity of parallel & distributed systems and managing multi-threaded it is clear that even the current schedulers will undergo some radical changes to meet those new challenges. It remains to be seen what each group comes up with to address these growing challenges

Resources

1. Jensen, Douglas E., C. Douglass Locke and Hideyuki Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems, Carnegie-Mellon University, 1985.

2. Stallings, William, Operating Systems: Internals and Design Principles, Pearson Prentice Hall, 2009.

3. McKusick, M. K. and Neville-Neil, G. V. 2004. Thread Scheduling in FreeBSD 5.2. Queue 2, 7 (Oct. 2004), 58-64. DOI= http://doi.acm.org/10.1145/1035594.1035622