Revision as of 04:59, 2 December 2010

Virtualize Everything But Time

Article written by Timothy Broomhead, Laurence Cremean, Julien Ridoux and Darrel Veitch. They are working for the Center for Ultra-Broadband Information Networks (CUBIN) Department of Electrical & Electronic Engineering at the University of Melbourne in Australia. Here is the link to the article: [1]

Background Concepts

The next time you notice one stranger ask another for the time and you see them check their watch, try this experiment: immediately ask too. Chances are the person will check their watch again. Why? Human internal clocks are notoriously unreliable. Our sense of time contracts and expands all day long. We seem to believe that a definitive report of time can only come from some mechanical or electronic source. So social norms require that the watch owner provides you with two things: 1) the time, and 2) a gesture of external authority, i.e. a glance at their watch.

The story of time inside a virtual machine is almost as unreliable as our own internal clocks. How much time has elapsed since a VM client got the CPU's attention? At the best of times there's no way for it to guess because it wasn't actually running. If the VM was suspended and migrated from one physical host to another its concept of time is even worse. This paper is about how a computer glances at its metaphorical watch, and what kinds of timepieces it has at hand.

To better understand this paper, it is very important to have a good understanding of the general concepts breached in it. For example, we all know what clocks are in our day-to-day life but what are they in the context of computing? In this section, we will describe concepts like timekeeping, hardware/software clocks, the advantages and disadvantages of the different available counters, synchronization algorithms and explains what is a para-virtualized system.

Timekeeping

For thousands of years, men have tried to find better ways to keep track of time. From sundials to atomic clocks, they were all made for the specific purpose of measuring the passage of time. This is not so different in computer operating systems. It is typically done in one of two ways: tick counting and tickless timekeeping[2]. Tick counting is when the operating system sets up an hardware device, generally a CPU, to interrupt at a certain rate. So each time one of those interrupts are called(a tick), the operating system will keep track of it in a counter. That will tell the system how much time has passed. In tickless timekeeping, instead of the OS keeping track of time through interrupts, a hardware device is used instead starting its own counter when the system is booted. The OS just need to read the counter from it when needed. Tickless timekeeping seems to be the better way to keep track of time because it doesn’t hog the CPU with hardware interrupts however, its performance is very dependent on the type of hardware used. Another disadvantages is that they tend to drift and can cause inaccuracy. I will explains those drifts later. But both of these are just counters. They don’t know what is the actual real-time. To remedy that, either a computer gets its time from a battery-backed real-time clock or it queries a network time server(NTP) to get the current time. The computer can also use software in the form of a daemon that will run periodically to make adjustments to the time.

Clocks

Computer “clocks” or “timers” can be hardware based, software based or they can even be an hybrid. The most commonly found timer is the hardware timer. All of the hardware timers can be generally described by this diagram where some have either more or less features:

Diagram1. Timer Abstraction

This diagram nicely represents how tick counting works. The oscillator runs at a predetermined frequency. The operating system might have to measure it when the system boots. The counter starts with a predetermined value which can be set by software. For every cycle of the oscillator, the counter counts down one unit. When it reaches zero, its generates an output signal that might interrupt the CPU. That same interrupt will then allow the counter’s initial value to be reloaded into the counter and the process begins again. Not all hardware timers work exactly like that. For instance, some actually count up, others don't use interrupts, and yet others don't keep an initial counter. The general principle of hardware counters is the however the same. There is some kind of fixed interval at the end of which the current time is updated by an appropriate number of units (i.e. nanoseconds).

Timers

PIT is useful for generating interrupts at regular intervals through its three channels. Channel 0 is bound to IRQ0 which interrupts the CPU at regular intervals. Channel 1 is specific to each system and Channel 2 is connected to the speaker system. As such, we only need to concern ourselves with Channel 0. [3]
CMOS RTC, also known as a CMOS battery, allows the CMOS chip to remain powered to keep track of things like time even while the physical PC unit has no source of power. If there is no CMOS battery on the motherboard, the computer would reset to its default time each restart. The battery itself can die, as expected, if the computer is powered off and not used for a long period of time. This can cause issues with the main OS as well as the VM. [4]
Local APIC handles all external interrupts for the processor in the system. It can also accept and generate inter-processor interrupts between Local APICs. [5]
ACPI establishes industry-standard interfaces configuration guided by the OS and power management. It is industry-standard through its creators, Intel, Microsoft, Phoenix, Hewlett Packard and Toshiba. Its power management includes all forms: notebooks, desktops, and servers. ACPI's goal is to improve current power and configuration standards for hardware devices by transitioning to ACPI-compliant hardware. This allows the OS as well as the VM to have control over power management. [6][7][8]
RDTSC is based on the x86 P5 instruction set and perform high-resolution timing, however, it suffers from several flaws. Discontinuous values from the processor are caused as a result of not using the same thread to the processor each time, which can also be caused by having a multicore processor. This is made worse by ACPI which will eventually lead to the cores being completely out of sync. Availability of dedicated hardware: "RDTSC locks the timing information that the application requests to the processor's cycle counter." With dedicated timing devices included on modern motherboards this method of locking the timing information will become obsolete. Lastly, the variability of the CPU's frequency needs to be taken into account. With modern day laptops, most CPU frequencies are adjusted on the fly to respond to the users demand when needed and to lower themselves when idle, this results in longer battery life and less heat generated by the laptop but regretfully affects RDTSC making it unreliable. [9]
HPET defines a set of timers that the OS has access to and can assign to applications. Each timer can generate an interrupt when the least significant bits are equal to the equivalent bits of the 64-bit counter value. However, a race case can occur in which the target time has already passed. This causes more interrupts and more work even if the task is a simple one. It does produce less interrupts than its predecessors PIT and CMOS RTC giving it an edge. Despite its race condition, this modern timer is improvement upon old practices. [10]

Guest Timekeeping

Guest timekeeping is done using the same general methods as any computer timekeeping, using either tick counting or tickless systems. Where the two begin to differ, however, is that a host operating system is able to communicate directly with the physical hardware, while the guest operating system is unable to do so, having to communicate with the host system that it wants to communicate with the hardware. Having to do this is the greatest source of the guest operating system's clock losing accuracy, or more simply called drifting.

Sources of Drift

When a guest operating system is started, its clock simply synchronizes with the host's – some virtual machines such as VMware also do this when it is resumed from a suspended state, or restored from a snapshot – so it is easy to think that, since it starts off correctly the guest's clock will continue to be correct. That is, of course, incorrect. The first source of drift is simply due to the drift a host incurs in its own timekeeping. A clock is almost never entirely accurate, having a slight error due to the time used to communicate with the counter, even on the host system, and because the guest communicates with the host in order to keep track of its time, an error in the host's time is not only passed on to the guest, but because the host is trying to correct its own time the guest's request for a count is given slightly less priority, making it yet again lose accuracy. The larger the drift in the host, the larger the drift in the guest, as the host's drift simply compounds the issue.

Aside from the host's own drift, the other cause of drift in the virtual environment is the fact that the it is treated like a process by the host. In and of itself this doesn't seem like a problem, but because of it it can be denied the CPU time required, or allocated less memory than needed. With restricted CPU time, it's easy for the requested ticks or requested read of a counter to pile up and create a backlog of requests, or simply receive the requested data late enough to throw its clock off. With memory, if the virtual environment does not have enough allocated to it by the host it can run into the problem of swapping out pages that are needed soon. Swapping the pages back in will momentarily bring the entire virtual environment to a halt, so ticks are missed and the clock falls behind.

Impact of Drift

The impact of drift essentially boils down to round-off errors and lost ticks. The practical impact of drift, however, is quite apparent in any automated system. For a relatable real-world example, though not in a virtual environment, in a factory's assembly line, the machinery is finely tuned to do its own specific part at certain intervals, and it generally does so with impressive efficiency. If the clock in the system were to drift, however, a specific machine may move too soon or too late, bringing the line to a potentially catastrophic halt. In a virtual environment, drift is a bit more subtle, as one result of it could be skewed process scheduling – some schedulers give a certain amount of time to a process before moving on, but if the guest's time has drifted substantially, when it tries to correct its time it could give more or less time to the processes in the scheduler.

Compensation Strategies

There are a number of compensation strategies for dealing with drift, depending on the cause of it. If the problem is due to CPU management issues, then the host can give more CPU time to the virtual machine, or it can lower the timer interrupt rate – or simply use a tickless counter. If it is due to a memory management issue, allocating more memory to the virtual environment should prevent the system from needing to swap out page files so often.

If the issue is from neither of those, but simply due to the inevitable lag when the guest communicates with the hardware via the host, then there are other methods to correct the drift. Most systems natively have algorithms built in to correct the time if it gets too far ahead or behind real time, though they are not without their own faults; if the time is set ahead when catching up, the backlog of ticks it has built up may not be cleared, so it could potentially set itself ahead multiple times until the backlog is dealt with. Tools built into the virtual machine itself can also deal with drift to an extent, as VMware Tools does. This kind of tool checks to see if the clock's error is within a certain margin. If it exceeds the margin, then the backlog is set to zero – to prevent the issue mentioned with the native algorithms – and resynchronizes with the host clock before the guest goes back to keeping track of time as it normally would.

Research problem

Today, the use of the Network Time Protocol and of daemons like ntpd is the dominant solution for accurate timekeeping. In optimal conditions, the ntpd can be very good but these situations rarely happen. Network congestion, disconnections, lower quality networking hardware and unsuspected system events can create offsets errors in the order of 10‘s or even 100 milliseconds(ms). [11] For demanding applications, this is neither robust or reliable. One way to enhance the performance of ntpd would be to poll from the NTP server more often as this would reduced the offset error but unfortunately, this would increase the network traffic which could cause network congestions which would raise the offset error. So this won’t work.

Another problem with current system software clocks using NTP(like ntpd), is that they provide only an absolute clock.[12] So for applications that deals with network managements and measurements, this is unsuitable. Why? Because NTP focus on offset and not on hardware clock oscillator rate. For example, when calculating delay variations, the offset error doesn’t change anything to the calculations but the clocks’ oscillator rate variation does affect it. So having a more accurate timestamp would make those calculation more precise. Which mean we would need another system software clock.

In virtualization(in this case Xen), when migrating a running system from one system to another can cause issues and this is again caused by the ntpd daemon. By default, each guest OS runs its own instance of the ntpd daemon. So the synchronization algorithm keeps track of the reference wallclock time, rate-of-drift and current clock error, which are defined by the hardware clock on the system. So when migrating the virtualized OS to another system, the ntpd state is saved and when it is enabled again on the new system, thats where the problems starts. Because no two hardware clocks drifts the same way or have the exact same wallclock time, all the information traced by the daemon are all of a sudden inaccurate. This could prove disastrous to the system. This could go from a slowly recoverable error to one where ntpd might never recover, making the virtualized OS unstable.

Contribution

(sections are stubs for the moment ... more to come) The contributions of this paper were:

baseline evaluations of:
- performance of NTP in dependent and independent configurations
- Xen Clocksource as a basis counter under NTP
- latencies of different clock sources
- implications of Power Management

new architecture
- RADclock
- XenStore as holder of clock parameter data
- feed-forward versus feedback

evaluation of RADclock vs ntpd

Critique

References

1. "Virtualize Everything But Time" by Timothy Broomhead, Laurence Cremean, Julien Ridoux and Darrel Veitch, 2010. http://www.usenix.org/events/osdi10/tech/full_papers/Broomhead.pdf

2. "Timekeeping in Virtual Machines, Information Guide" from VMWare. http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf

3. "Bran's Kernel Development Tutorial" from Bona Fide OS Developer website. http://www.osdever.net/bkerndev/Docs/pit.htm

4. "What is a CMOS battery, and why does my computer need one?" from the Indiana University's Knowledge Base, 2010. http://kb.iu.edu/data/adoy.html

5. "Multiprocessor Specification version 1.4" from Intel, 1997. http://developer.intel.com/design/pentium/datashts/24201606.pdf

6. "PC Based Precision Timing Without GPS" by Attila Pa ́sztor and Darryl Veitch, 2002. http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/tscclock_final.pdf

7. "Robust Synchronization of Absolute and Difference Clocks over Networks" by Darryl Veitch, Julien Ridoux and Satish Babu Korada, 2009. http://www.cubinlab.ee.unimelb.edu.au/~darryl/Publications/synch_ToN.pdf

8. Broomhead, T.; Ridoux, J.; Veitch, D.; , "Counter availability and characteristics for feed-forward based synchronization," Precision Clock Synchronization for Measurement, Control and Communication, 2009. ISPCS 2009. International Symposium on , vol., no., pp.1-6, 12-16 Oct. 2009

@@ Line 29: / Line 29: @@
 # CMOS RTC, also known as a CMOS battery, allows the CMOS chip to remain powered to keep track of things like time even while the physical PC unit has no source of power. If there is no CMOS battery on the motherboard, the computer would reset to its default time each restart. The battery itself can die, as expected, if the computer is powered off and not used for a long period of time. This can cause issues with the main OS as well as the VM. [http://kb.iu.edu/data/adoy.html]
 # Local APIC handles all external interrupts for the processor in the system. It can also accept and generate inter-processor interrupts between Local APICs. [http://developer.intel.com/design/pentium/datashts/24201606.pdf]
-# ACPI establishes industry-standard interfaces configuration guided by the OS and power management. Power Management includes notebooks, desktops, and servers. ACPI's goal is to improve current power and configuration standards for hardware devices by transitioning to ACPI-compliant hardware. This allows the OS as well as the VM to have control over power management. [http://www.intel.com/technology/iapc/acpi/][http://www.acpi.info/][http://www.acpi.info/DOWNLOADS/ACPIspec40a.pdf]
+# ACPI establishes industry-standard interfaces configuration guided by the OS and power management. It is industry-standard through its creators, Intel, Microsoft, Phoenix, Hewlett Packard and Toshiba. Its power management includes all forms: notebooks, desktops, and servers. ACPI's goal is to improve current power and configuration standards for hardware devices by transitioning to ACPI-compliant hardware. This allows the OS as well as the VM to have control over power management. [http://www.intel.com/technology/iapc/acpi/][http://www.acpi.info/][http://www.acpi.info/DOWNLOADS/ACPIspec40a.pdf]
 # RDTSC is based on the x86 P5 instruction set and perform high-resolution timing, however, it suffers from several flaws. Discontinuous values from the processor are caused as a result of not using the same thread to the processor each time, which can also be caused by having a multicore processor. This is made worse by ACPI which will eventually lead to the cores being completely out of sync. Availability of dedicated hardware: "RDTSC locks the timing information that the application requests to the processor's cycle counter." With dedicated timing devices included on modern motherboards this method of locking the timing information will become obsolete. Lastly, the variability of the CPU's frequency needs to be taken into account. With modern day laptops, most CPU frequencies are adjusted on the fly to respond to the users demand when needed and to lower themselves when idle, this results in longer battery life and less heat generated by the laptop but regretfully affects RDTSC making it unreliable. [http://msdn.microsoft.com/en-us/library/ee417693%28VS.85%29.aspx]
-# HPET (I will fill these in later, Dec. 1st Update) --[[User:Spanke|Spanke]]
+# HPET defines a set of timers that the OS has access to and can assign to applications. Each timer can generate an interrupt when the least significant bits are equal to the equivalent bits of the 64-bit counter value. However, a race case can occur in which the target time has already passed. This causes more interrupts and more work even if the task is a simple one. It does produce less interrupts than its predecessors PIT and CMOS RTC giving it an edge. Despite its race condition, this modern timer is improvement upon old practices.  [http://hackipedia.org/Hardware/HPET,%20High%20Performance%20Event%20Timer/IA-PC%20HPET%20%28High%20Precision%20Event%20Timers%29%20Specification.pdf]
 ==Guest Timekeeping==