COMP 3000 Essay 2 2010 Question 10
mClock: Handling Throughput Variability for Hypervisor I/O Scheduling
Paper
mClock: Handling Throughput Variability for Hypervisor I/O Scheduling
Authors:
Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com
Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org
Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu
Background Concepts
Virtual machines (VMs) are becoming increasing significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.
Hypervisors are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.
Throughput is the number of jobs per hour that a system completes. In general, a system is considered more efficient if it has a higher throughput 6. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.
SFQ, or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.
SFQ(D) (Start-time Fair Queuing)2 3 performed fairly well for low-intensity workloads. However, as the workload with VMs multiplied, the constant need for faster performance and efficiency rose. Hypervisors required a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed, to aid hypervisors.
mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Evidently, it is the better alternate to SFQ2 3 because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.
What mClock is basically trying to achieve is to combine a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. All thats left is that the weight-based scheduler distribute the remaining IO throughput to the rest of the VMs equally.
Research problem
Problem Facing
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) 1 is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ 2 3. What this means is that PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.
Solution
We need an algorithm which can handle all kinds of controls and properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ3.
Contribution
This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited solely by providing proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.
Older methods of I/O resource allocation had a terrible performance loss. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods; for instance, SFQ.
The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. 2
mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that “VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. 2
mClock was implemented on a modified version of VMware ESX server hypervisor 4 5. This modification only took around 200 of C code for the scheduling framework to use mClocks algorithms. This shows that it didn't take much to implement mClock effectively in a existing product, and improving its performance results. This seems like mClock can be very portable and easy to implement in other hypervisors as well.
Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.
Critique
The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity. The good thing about this is that the algorithm proves to be (more efficient compare to existing methods) efficient in clustered architectures due to better resource allocation while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one HOST without the constant worry of performance levels, with each VM add-on.
The paper proposes a better, and effective alternative to SFQ and other older methods; the mClock algorithm which efficiently handles multiple VMs in a throughput environment (LUN, PARDA).
One aspect of the writing style , "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" 2. The style of displaying these calculations depicts a messy, unorganized styled.<math></math>
References
1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional Allocation of Resources in Distributed Storage Access. In (FAST ’09) Proceedings of the Seventh Usenix Conference on File and Storage Technologies, Feb 2009.
2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional sharing for a storage service utility. In ACM SIGMET- RICS, 2004. Interposed proportional sharing for a storage service utility
3 P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair Queuing: A scheduling algorithm for integrated services packet switching networks. Technical Report CS-TR-96- 02, UT Austin, January 1996.
4 VMware ESX Server User Manual, December 2007. VMware Inc.
5 VMware, Inc. Introduction to VMware Infrastructure. 2007. http://www.vmware.com/support/pubs/.
6 A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.