COMP 3000 Essay 2 2010 Question 9: Difference between revisions

From Soma-notes
Hesperus (talk | contribs)
Added a new and improved diagram to nested virtualization section.
m Protected "COMP 3000 Essay 2 2010 Question 9" ([edit=sysop] (indefinite) [move=sysop] (indefinite))
 
(103 intermediate revisions by 5 users not shown)
Line 31: Line 31:
=Background Concepts=
=Background Concepts=


Before we delve into the details of our research paper, its essential that we provide some insight and background to the concepts  
Before we delve into the details of our research paper, it is essential that we provide some insight and background to the concepts  
and notions discussed by the authors.
and notions discussed by the authors.


===Virtualization===
===Virtualization===


In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest operating system the illusion that its running on the bare hardware. But the reality is, we're actually running the virtual machine as an application on the host OS.
In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest virtual machine the illusion that its running on the bare hardware. But realistically, the host operating system treats the virtual machine as an application.
   
   
The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.
The term virtualization has become rather broad, associated with a number of areas where this technology is used: data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.


====Hypervisor====  
====Hypervisor====  
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guests among one another, and with the host hardware and operating system. It also controls host resources. [2]
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervisor is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may arise due to the interaction of those guests among one another, and with the host hardware and operating system. The hypervisor also has control of host resources without the host truly knowing which resources the VMM are controlled. [2]


====Nested virtualization====
====Nested virtualization====
The concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).  
Nested virtualization is the concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).  
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization|left|400px]]
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.|left|400px]]
 
====Para-virtualization====
A virtualization model that requires the guest OS kernel to be modified in order to have some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, it rather relies on a software interface that we must implement in the guest so that it can have some privileged hardware access via special instructions called hypercalls. The advantage here is that we have less environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note is that some operating systems such as Windows don't support para-virtualization. [3]
 
===Models of virtualization===
 
=====Trap and emulate model=====
A model of virtualization based on the idea that when a guest hypervisor attempts to execute higher level instructions or access privileged hardware components, it triggers a trap or a fault which gets handled or caught by the host hypervisor. Based on the hardware model of virtualization support, the host hypervisor (L0) then determines whether it should handle the trap or whether it should forwards it to the the responsible parent of that guest hypervisor at a higher level.


====Protection rings====
====Protection rings====
In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3.
In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3.
Ring 0 is the most privilged level, allowing access to the bare hardware components. The operating system kernel must  
Ring 0 (root mode) is the most privilged level, allowing access to the bare hardware components. The operating system kernel must  
execute at Ring 0 in order to access the hardware and secure control. User programs execute at Ring 3. Ring 1 and Ring 2 are dedciated to device drivers and other operations.
execute in Ring 0 in order to access the hardware and secure control. User programs execute in Ring 3 (guest mode). Ring 1 and Ring 2 are dedicated to device drivers and other operations. [7]


In virtualization, the host hypervisor executes at Ring 0. While the guest virtual machine should normally execute at Ring 3, since its treated as an application, in order to gain privileged hardware access, it triggers a trap that gets handled by the host hypervisor and it eventually end up running in Ring 0.  
In nested virtualization, the host hypervisor must execute at Ring 0 (root level) to access and emulate the hardware for the virtual machines which run in guest mode.


====Models of hardware support====
====Para-virtualization====
Para-virtualization is virtualization model that requires the guest OS kernel to be modified in order to allow the model some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, but rather it relies on a software interface that is implemented in the guest kernel to allow privileged hardware access via special instructions called hypercalls. The advantage here is that there are fewer environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]


=====Multiple-level architecture=====
===x86 models of virtualization===
Every parent hypervisor handles every other hypervisor running on top of it. For instance, assume that L0 (host hypervisor) runs the VM L1. When L1 attempts to execute a privileged instruction and a trap occurs, then the parent of L1, which is L0 in this case, will handle the trap. If L1 runs L2, and L2 attempts to execute privileged instructions as well, then L1 will act as the trap handler. More generally, every parent hypervisor at level Ln will act as a trap handler for its guest VM at level Ln+1. This model is not supported by the x86 based systems that are discussed in our research paper.
 
=====Trap and emulate model=====
The trap and emualte model is based on the idea that when a guest virtual machine attempts to execute privileged instructions, it triggers a trap or a fault that goes down to level L0, where the host hypervisor resides. Since the host hypervisor is the only one capable of executing privileged instructions at Ring 0, it handles this trap caused by the guest and provides an emulation of the desired instruction to the guest. This way, the guest will gain Ring 0 privilege through the help of the hypervisor. But it is important to know that the guest is unaware of this emulation and it operates as if it was running on the bare hardware.


=====Single-level architecture=====
=====Single-level architecture=====
The model supported by x86 based systems. In this model, everything must go back to the main host hypervisor at the L0 level. For instance, if the host hypervisor (L0) runs L1, when L1 attempts to run its own virtual machine L2, this will trigger a trap that goes down to L0. Then L0 sends the result of the requested instruction back to L1. Generally, a trap at level Ln will be handled by the host hypervisor at level L0 and then the resulting emulated instruction goes back to Ln.
The x86 based systems are based on the single-level of architecture virtualization support. In this hardware model, the host hypervisor L0 (running in Ring 0) handles all traps caused by any guest hypervisor running at any level of the virtualization stack. Assume that host hyperrvisor (L0) runs L1. When L1 attempts to run its own virtual machine called L2, this causes a trap that goes down to host hypervisor at level L0, L0 then handles the trap and initiates the required emulation for L1 to create L2. More generally, every trap occuring at level Ln, causes a drop to the L0 level where the host hypervisor resides. The host hypervisor then forwards this trap to the parent of Ln which is Ln-1, which in turn causes the trap to go down to L0 again, and so on. Those trap handling switches keep occurring until the desired emulated result reaches Ln to allow him the privilege to execute (Figure 2).
 
[[File:single-levelV.png|thumb|right|Figure 2: The single-level architecture support for virtualization that relies on the host hypervisor (L0) to handle every trap caused by a guest.|left|400px]]


===The uses of nested virtualization===
===The uses of nested virtualization===


====Compatibility====
====Compatibility====
A user can run an application thats not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibily mode of other operating systems or applications, an example of this is the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.
A user can run a particular application or OS that is not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibility mode of other operating systems or applications. An example of this is the Windows XP mode that is available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.


====Cloud computing====
====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and customers can have the freedom of implementing their systems on the host hardware without worrying about compatibility issues. [9]


The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware.
The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware. [16]


====Security====  
====Security====  
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is basically a hollow program or network that appears to be functioning to outside users, but in reality, its only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machines and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored.
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is a hollow program or network that appears to be functioning to outside users, but in reality, it is only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machine and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored. [15]


====Migration/Transfer of VMs====
====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster  
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster  
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
recovery. Consider a scenario where a number of virtual machines must be moved to a new hardware server for upgrade. Instead of moving each VM separately, we can nest those virtual machines and their hypervisors to create one nested entity that is easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents. [8]


====Testing====
====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
a file on the host operating system, if it is corrupted or damaged, it can easily be removed, recreated or even restored since a snapshot of the running virtual machine can be created and restored.
can create a snapshot of the running virtual machine.
 
 


=Research problem=
=Research problem=


'''Rough version. Let me know of any comments/improvements that can be made on the talk page'''--[[User:Mbingham|Mbingham]] 19:51, 30 November 2010 (UTC)
Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have recently been software based solutions made available [5], however, these solutions suffer from significant performance problems.
 
Nested virtualization has been studied since the mid 1970s (see paper citations 21,22 and 36). Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have also recently been software based solutions (see citation 12), however these solutions suffer from significant performance problems.


The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the numer of control switches between different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it potentially in the worst case reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.  
The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the number of control switches between the different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it arrives at potentially the worst case possible, where it reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.  


Generally, solutions that requie architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It solves the problem of a single nested trap expanding into many more trap instructions, which allows efficient virtualization without architectural support.
Generally, solutions that require architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It is, for the most part, able to contain the problem of a single nested trap expanding into many more trap instructions, at least for the nesting depth the authors considered, which allows efficient virtualization without architectural support.


More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.
More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.


=Contribution=
=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)
The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example, the paper Accountable Virtual Machines wraps programs around a particular state VM which could be most definitely placed on a separate hypervisor for ideal isolation.
 
==Theory==
The fundamental idea of the Turtles Project relies on multiplexing the hardware among the involved nested virtual machines. When a virtual machine like L1 attempts to run L2, this triggers a trap that gets handled by L0, just as we illustrated earlier in the single-hardware architecture model. This trap includes the environment specifications that are needed to run L2 on the bare hardware. L0 converts L2's virtual memory to L1's virtual memory to make them run on the same level. This way, we flatten the virtualization levels and run L2 as an L1 virtual machine. We could generalize this scheme and keep running every new virtual machine on the L1 level. This way, we end up with a two dimensions architecture, bringing performance closer to the hardware level.
 
However, one question that springs to mind, is that, if the host hypervisor ends up running and multiplexing the hardware among the guest virtual machines, then how can we keep track of the virtualization levels ? How can we know which virtual machine runs on top of the other ? The answer lies within the host hypervisor. By using special control structures like the VMCS and the VMCB, the hypervisor has the ability to diffrentiate between the different levels and keep track of the different virtual environments and their associated guests.


==CPU Virtualization==
We illustrate the CPU virtualization implemented by the paper by using a two level nested virtualization (L0->L1->L2).


The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example the paper Accountable Virtual Machines wraps programs around a particular state VM which could most definitely be placed on a separate hypervisor for ideal isolation.
Assume that L0 (the main host hypervisor) runs L1 as a virtual machine. L0 uses VMCS0->1 to run L1.
The VMCS is the fundamental data structure the hypervisor uses to keep track of the virtual machine environment. The VMCS desribes the virtual machine specifications that are passed to the CPU for execution. L1 (also a hypervisor) prepares VMCS1->2 to run its own virtual machine called L2. L2 starts to execute the instruction Vmlaunch, which is a special instruction for starting a new virtual machine.  


==Theory==
The Vmlaunch instruction will trap and L0 will have to handle the trap because L1 is running as a virtual machine and
we're implementing the single-level architecture of virtualization that was discussed earlier, where every trap drops to the host hypervisor at L0. L0 can not execute L2 directy using L1->L2 because the environment is not valid within L0's scope. So in order for L0 to run L2 and multiplex the hardware between L1 and L2, L0 uses L0->L1 and L1->L2 to create L0->L2, which is a new environment specification, enabling L0 to run L2 directly.


==CPU Virtualization==
L0 will now launch L2 which can cause it to trap. L0 can handle the trap itself or will forward it to L1 depending on whether its L1's responsibility to handle the trap. The way L1 handles a single L2 exit is by reading and writing to the VMCS disable interrupts and updating the environments. While normally, this shouldn't be a problem, but since L1 is running in guest mode as a virtual machine, all the traps leading to a high level L2 exit or L3 exit can cause accumulate many exit calls, which can substantially lower the performance.
How does the Nest VMX virtualization work for the turtle project: L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle. The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously. -csulliva


==Memory virtualization==
==Memory virtualization==


How Multi-dimensional paging work on the turtle project. The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.
To virtualize memory, the paper introduces a new technique called multi-dimensional paging as an alternative to shadow pages.
 
Shadow pages is a technique used by operating systems in memory virtualization to map the guest virtual machine page tables
to the physical hardware (the pages that were allocated to them by the host hypervisor at the hardware level). This operation is expensive since every request from the shadow page table to the physical address requires a significant anmount of CPU and memory resources. Not to mention that the hypervisor must update the physcial addresses every time their shadow equivalents change in the virtual machine level. [17]
 
Recently, x86 based systems have added two-dimensional page tables where a second table called the EPT is used to map the guest physical address to host physcial address, improving the level of performance. But we could still see that, even with two-dimensional
pages, we would have obvious issues when it comes to nested virtualization.
 
The concept of multi-dimensional page tables is pretty much tied into the two-dimensional page tables. What the authors have proposed, is that when a host hypervisor (L0) runs L1, it exposes L1 to its one and only EPT table. L0 can share this table with L1, creating a memory environment called EPT0->1, in a similar fashion to what we showed in the CPU virtualization. We can extend on this model and make L1 creates its own EPT1->2 with its own virtual machine L2. So when L0 comes to handle L2 page tables, it can use EPT0->1 and EPT1->2 to create EPT0->2. EPT0->2 is used to allocate L2's memory pages. This way, we reduce the number of exit calls and improve the peformance level of nested memory virtualization.


==I/O virtualization==
==I/O virtualization==


How does I/O virtualization work on the turtle project? There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly
here are 3 fundamental way to virtual machine access the I/O:
 
* Device emulation using a host hypervisor. [10]
* Para-virtualized drivers that require modification of the guest kernel. [11][12]
* Direct device assignment. Typically results in the best performance. [13][14]
 
 
The paper argues that, in a two level nested-virtualization (L0->L1->L2), we can theoretically mix and combine any two of those 3 approaches to implement I/O virtualization. For instance, we can implement device simulation at the L0->L1 level, but implement device assignment approach for the level L1->L2. However, the authors chose to focus their attention on direct device assignment, since ideally it produces best levels of performance.
 
To get the best performance, they used a IOMMU (Input/Output memory management unit) for safe DMA (Device management assignemnt) bypass. With nested 3X3 options for I/O virtualization, they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1.  
To do this, they had to memory map I/O with program I/0 with DMA with interrupts. The idea with DMA is that of each hypervisor L0, L1 needs to use a IOMMU to allow its virtual machine to access the device in order to bypass safety. There is only one plate for IOMMU so L0 needs to emulate an IOMMU.  
L0 will then compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. The device DMA's are stored into the L2 memory space directly.


==Macro optimizations==
==Macro optimizations==
How they implement the Micro-Optimizations to make it go faster on the turtle project? The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).
The testing achieved by the reserach shows thst the two main places where a guest of a nested hypervisor is slower than the same guest running on a bare metal hypervisor occur at the second transition between L1 to L2, and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified (not involving any binary translations), the required changes were found in L0 only.  
 
They optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0, the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if it is being modified.  They carefully balance full copying versus partial copying and tracking. The VMCs are optimized further by copying multiple VMC fields at once. Normally, by intel's specification read or writes must be performed using the vmread and vmwrite instruction (operate on a single field). VMC's data can be accessed without the ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. This may not work on processors other than the ones that were used in testing. The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. Vmread and Vmwrite are used by the hypervisor to change the guest and host specification (causing a L1 exit multiple times while it handles a single L2 exit). By using AMD SVM, the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).


=Critique=
=Critique=


=== The good ===
===Pros===


The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers and does not affect the end-user in clearly detectable deviation regarding the usage of applications on top this architecture. Nevertheless contribution is visible with respect to security and compatibility. Since this is first successful implementations of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.
The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers, and should not have too large of an effect on an end user running an application in a nested virtual machine. This is especially true if the user is using the system at a low depth. One can further argue that the most common use cases for nested virtualization that the authors mention in section 1, such as virtualizing OSs that are already hypervisors (like windows 7) and hypervisors in the cloud, will be at a shallow depth. It then follows that the testing the authors do in section 4 covers the most common use cases, so users can expect similar impressive performance. Nevertheless, contribution is visible with respect to security and compatibility. On the security side, this nested virtualization technique can be used to study hypervisor level rootkits, such as bluepill [6], by hosting an infected hypervisor as a guest on top of another hypervisor.


=== The bad ===
Since this is the first successful implementation of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in the nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.


The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization enters our lives. The performance hit is mainly imposed by exponentially generated exits. Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5, just to see what the effect is. Another significant detriment in the paper is that optimizations such as vmread and vm write avoidance are aimed at specific machines (e.i . Intel).
===Cons ===


=== The style of paper ===
The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization keeps garnering attention from the computing world. The performance hit is mainly imposed by exponentially generated exits which are caused when a nested guest traps, handing control to the lowest level hypervisor, which may hand off the trap to hypervisors above it before finally returning to the guest. So we can see that time-complexity might be an issue when multiple levels of virtualization are involved, the paper fails to demonstrate that the design is capable of handling a much deeper level of nesting.


The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does a good job to convey the technical details. Depending on the level of enlightenment towards the background knowledge it appears very complex and personally it required quite some research before my fully delving into the theory of the design. For instance the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1. All in all, the provided video highly in depth increased my awareness in the subject of nested hypervisors.
Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5. This is because it can be difficult to predict how the system will react to large levels of nesting, because the increase in the number of traps and other performance killing problems can potentially be exponential as the nesting gets deeper. Another significant detriment is that the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5: "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested". This means that some of the techniques the authors use to increase performance are not reproducible on other systems, and so the generality of parts of their solution may be limited.
 
===Style and Presentation ===
 
The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does an excellent job conveying the technical details. The paper does seem to assume a high level of background knowledge and familiarity with the subject, especially with some more technical points of the architecture the hardware uses to implement virtualization. For example, the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1, which may not be familiar to people not used to the technical language. The paper does, however, touch on a wide range of topics in the field of virtualization, including CPU, Memory and IO device virtualization. This wide scope means that many of the major components of virtualization are discussed, so in the process of understanding the paper one learns a lot about many different parts of the field.


=== Conclusion ===
=== Conclusion ===
 
Some could argue that the research showed in the Turtles Project is the first to achieve a significant level of efficieny in x86 nested-virtualization without altering the hardware, relying on improved techniques and mechanisms. This is a major improvement over the current available solutions. The paper has garnered a lot of attention and appraisal from the computing world including the Jay Lepreau best paper award. More recently, a paper on nested virtualization that examines trap forwarding has been published by the VMWare research lab [18], the paper makes references and uses some of the data that was published and documented in the Turtles Project.
Bottom line, the research showed in the paper is the first to achieve efficient x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.


=References=
=References=
Line 153: Line 177:


[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.
[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.
[4] Goldberg, P. [http://portal.acm.org/citation.cfm?id=800122.803950 Architecture of Virtual Machines]. In ''Proceedings of the Workshop on Virtual Computer Systems'', ACM pp. 74-112
[5] Berghmans, O. Nesting Virtual Machines in Virtualization Test Frameworks. Master's Thesis, Unversity of Antwerp, 2010.
[6] Presentation by Joanna Rutkowska, Black Hat Briefings 2006.
[7] Buytaert, Dittner & Rule. ''The best damn server virtualization book period.'' pages 16-18
[8] Clark, Fraser, Hand, Hansen, Jul, Limpach, Pratt & Warfield. ''Live migration of virtual machines''. page 273-286
[9] Gillam, Lee (2010). ''Cloud Computing: Principles, Systems and Applications''. page 26-27
[10] Sugerman, Venkitachalm & Lim. (2001). [http://portal.acm.org/citation.cfm?id=1618525.1618534 ''Virtualizing I/O devices on VMware workstation’s hosted virtual machine monitor.'']
[11] Russell, Rusty (2008). [http://portal.acm.org/citation.cfm?id=1400097.1400108&coll=GUIDE&dl=GUIDE&idx=J597&part=newsletter&WantType=Newsletters&title=ACM%20SIGOPS%20Operating%20Systems%20Review ''virtio: towards a de-facto standard for virtual I/O devices'']
[12] Barham, Dragovic, Fraser, Hand, Harris, Ho, Neugebauer, Pratt & Warfield (2003). [http://portal.acm.org/citation.cfm?doid=945445.945462 ''Xen and the art of virtualization'']
[13] Levasseur, Uhlig, Stoess & Gotz (2004). [http://portal.acm.org/citation.cfm?id=1251256 ''Unmodified device driver reuse and improved system dependability via virtual machines''.]
[14] Yassour, Ben-Yehuda & Wasserman (2008). [http://www.mulix.org/misc/hv.pdf D''irect device for untrsuted fully-virtualized virtual machines'']
[15] ''Proceedings of the 5th European Conference on Information Warfare and Security : National Defence College, Helsinki, Finland, 1 - 2 June 2006''. p 245-246
[16] [http://aws.amazon.com/ Amazon Web Services]
[17] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 576-577
[18] Poon and K. Mok [http://www.vmware.com.mx/files/pdf/partners/academic/hh.pdf ''Bounding the running time of interrupt and exception forwarding in recursive virtualization for the x86 architecture'' ]

Latest revision as of 19:45, 3 December 2010

Go to discussion for group members confirmation, general talk and paper discussions.


Paper

"The Turtles Project: Design and Implementation of Nested Virtualization"

Authors:

  • Muli Ben-Yehuday +
  • Michael D. Day ++
  • Zvi Dubitzky +
  • Michael Factor +
  • Nadav Har’El +
  • Abel Gordon +
  • Anthony Liguori ++
  • Orit Wasserman +
  • Ben-Ami Yassour +

Research labs:

+ IBM Research – Haifa

++ IBM Linux Technology Center


Website: http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

Video presentation: http://www.usenix.org/multimedia/osdi10ben-yehuda [Note: username and password are required for entry]


Background Concepts

Before we delve into the details of our research paper, it is essential that we provide some insight and background to the concepts and notions discussed by the authors.

Virtualization

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest virtual machine the illusion that its running on the bare hardware. But realistically, the host operating system treats the virtual machine as an application.

The term virtualization has become rather broad, associated with a number of areas where this technology is used: data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.

Hypervisor

Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervisor is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may arise due to the interaction of those guests among one another, and with the host hardware and operating system. The hypervisor also has control of host resources without the host truly knowing which resources the VMM are controlled. [2]

Nested virtualization

Nested virtualization is the concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).

Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.

Protection rings

In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3. Ring 0 (root mode) is the most privilged level, allowing access to the bare hardware components. The operating system kernel must execute in Ring 0 in order to access the hardware and secure control. User programs execute in Ring 3 (guest mode). Ring 1 and Ring 2 are dedicated to device drivers and other operations. [7]

In nested virtualization, the host hypervisor must execute at Ring 0 (root level) to access and emulate the hardware for the virtual machines which run in guest mode.

Para-virtualization

Para-virtualization is virtualization model that requires the guest OS kernel to be modified in order to allow the model some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, but rather it relies on a software interface that is implemented in the guest kernel to allow privileged hardware access via special instructions called hypercalls. The advantage here is that there are fewer environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]

x86 models of virtualization

Trap and emulate model

The trap and emualte model is based on the idea that when a guest virtual machine attempts to execute privileged instructions, it triggers a trap or a fault that goes down to level L0, where the host hypervisor resides. Since the host hypervisor is the only one capable of executing privileged instructions at Ring 0, it handles this trap caused by the guest and provides an emulation of the desired instruction to the guest. This way, the guest will gain Ring 0 privilege through the help of the hypervisor. But it is important to know that the guest is unaware of this emulation and it operates as if it was running on the bare hardware.

Single-level architecture

The x86 based systems are based on the single-level of architecture virtualization support. In this hardware model, the host hypervisor L0 (running in Ring 0) handles all traps caused by any guest hypervisor running at any level of the virtualization stack. Assume that host hyperrvisor (L0) runs L1. When L1 attempts to run its own virtual machine called L2, this causes a trap that goes down to host hypervisor at level L0, L0 then handles the trap and initiates the required emulation for L1 to create L2. More generally, every trap occuring at level Ln, causes a drop to the L0 level where the host hypervisor resides. The host hypervisor then forwards this trap to the parent of Ln which is Ln-1, which in turn causes the trap to go down to L0 again, and so on. Those trap handling switches keep occurring until the desired emulated result reaches Ln to allow him the privilege to execute (Figure 2).

Figure 2: The single-level architecture support for virtualization that relies on the host hypervisor (L0) to handle every trap caused by a guest.

The uses of nested virtualization

Compatibility

A user can run a particular application or OS that is not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibility mode of other operating systems or applications. An example of this is the Windows XP mode that is available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

Cloud computing

A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and customers can have the freedom of implementing their systems on the host hardware without worrying about compatibility issues. [9]

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware. [16]

Security

We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is a hollow program or network that appears to be functioning to outside users, but in reality, it is only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machine and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored. [15]

Migration/Transfer of VMs

Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster recovery. Consider a scenario where a number of virtual machines must be moved to a new hardware server for upgrade. Instead of moving each VM separately, we can nest those virtual machines and their hypervisors to create one nested entity that is easier to deal with and more manageable. In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents. [8]

Testing

Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially a file on the host operating system, if it is corrupted or damaged, it can easily be removed, recreated or even restored since a snapshot of the running virtual machine can be created and restored.

Research problem

Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have recently been software based solutions made available [5], however, these solutions suffer from significant performance problems.

The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the number of control switches between the different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it arrives at potentially the worst case possible, where it reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.

Generally, solutions that require architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It is, for the most part, able to contain the problem of a single nested trap expanding into many more trap instructions, at least for the nesting depth the authors considered, which allows efficient virtualization without architectural support.

More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.

Contribution

The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with their choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example, the paper Accountable Virtual Machines wraps programs around a particular state VM which could be most definitely placed on a separate hypervisor for ideal isolation.

Theory

The fundamental idea of the Turtles Project relies on multiplexing the hardware among the involved nested virtual machines. When a virtual machine like L1 attempts to run L2, this triggers a trap that gets handled by L0, just as we illustrated earlier in the single-hardware architecture model. This trap includes the environment specifications that are needed to run L2 on the bare hardware. L0 converts L2's virtual memory to L1's virtual memory to make them run on the same level. This way, we flatten the virtualization levels and run L2 as an L1 virtual machine. We could generalize this scheme and keep running every new virtual machine on the L1 level. This way, we end up with a two dimensions architecture, bringing performance closer to the hardware level.

However, one question that springs to mind, is that, if the host hypervisor ends up running and multiplexing the hardware among the guest virtual machines, then how can we keep track of the virtualization levels ? How can we know which virtual machine runs on top of the other ? The answer lies within the host hypervisor. By using special control structures like the VMCS and the VMCB, the hypervisor has the ability to diffrentiate between the different levels and keep track of the different virtual environments and their associated guests.

CPU Virtualization

We illustrate the CPU virtualization implemented by the paper by using a two level nested virtualization (L0->L1->L2).

Assume that L0 (the main host hypervisor) runs L1 as a virtual machine. L0 uses VMCS0->1 to run L1. The VMCS is the fundamental data structure the hypervisor uses to keep track of the virtual machine environment. The VMCS desribes the virtual machine specifications that are passed to the CPU for execution. L1 (also a hypervisor) prepares VMCS1->2 to run its own virtual machine called L2. L2 starts to execute the instruction Vmlaunch, which is a special instruction for starting a new virtual machine.

The Vmlaunch instruction will trap and L0 will have to handle the trap because L1 is running as a virtual machine and we're implementing the single-level architecture of virtualization that was discussed earlier, where every trap drops to the host hypervisor at L0. L0 can not execute L2 directy using L1->L2 because the environment is not valid within L0's scope. So in order for L0 to run L2 and multiplex the hardware between L1 and L2, L0 uses L0->L1 and L1->L2 to create L0->L2, which is a new environment specification, enabling L0 to run L2 directly.

L0 will now launch L2 which can cause it to trap. L0 can handle the trap itself or will forward it to L1 depending on whether its L1's responsibility to handle the trap. The way L1 handles a single L2 exit is by reading and writing to the VMCS disable interrupts and updating the environments. While normally, this shouldn't be a problem, but since L1 is running in guest mode as a virtual machine, all the traps leading to a high level L2 exit or L3 exit can cause accumulate many exit calls, which can substantially lower the performance.

Memory virtualization

To virtualize memory, the paper introduces a new technique called multi-dimensional paging as an alternative to shadow pages.

Shadow pages is a technique used by operating systems in memory virtualization to map the guest virtual machine page tables to the physical hardware (the pages that were allocated to them by the host hypervisor at the hardware level). This operation is expensive since every request from the shadow page table to the physical address requires a significant anmount of CPU and memory resources. Not to mention that the hypervisor must update the physcial addresses every time their shadow equivalents change in the virtual machine level. [17]

Recently, x86 based systems have added two-dimensional page tables where a second table called the EPT is used to map the guest physical address to host physcial address, improving the level of performance. But we could still see that, even with two-dimensional pages, we would have obvious issues when it comes to nested virtualization.

The concept of multi-dimensional page tables is pretty much tied into the two-dimensional page tables. What the authors have proposed, is that when a host hypervisor (L0) runs L1, it exposes L1 to its one and only EPT table. L0 can share this table with L1, creating a memory environment called EPT0->1, in a similar fashion to what we showed in the CPU virtualization. We can extend on this model and make L1 creates its own EPT1->2 with its own virtual machine L2. So when L0 comes to handle L2 page tables, it can use EPT0->1 and EPT1->2 to create EPT0->2. EPT0->2 is used to allocate L2's memory pages. This way, we reduce the number of exit calls and improve the peformance level of nested memory virtualization.

I/O virtualization

here are 3 fundamental way to virtual machine access the I/O:

  • Device emulation using a host hypervisor. [10]
  • Para-virtualized drivers that require modification of the guest kernel. [11][12]
  • Direct device assignment. Typically results in the best performance. [13][14]


The paper argues that, in a two level nested-virtualization (L0->L1->L2), we can theoretically mix and combine any two of those 3 approaches to implement I/O virtualization. For instance, we can implement device simulation at the L0->L1 level, but implement device assignment approach for the level L1->L2. However, the authors chose to focus their attention on direct device assignment, since ideally it produces best levels of performance.

To get the best performance, they used a IOMMU (Input/Output memory management unit) for safe DMA (Device management assignemnt) bypass. With nested 3X3 options for I/O virtualization, they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this, they had to memory map I/O with program I/0 with DMA with interrupts. The idea with DMA is that of each hypervisor L0, L1 needs to use a IOMMU to allow its virtual machine to access the device in order to bypass safety. There is only one plate for IOMMU so L0 needs to emulate an IOMMU. L0 will then compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. The device DMA's are stored into the L2 memory space directly.

Macro optimizations

The testing achieved by the reserach shows thst the two main places where a guest of a nested hypervisor is slower than the same guest running on a bare metal hypervisor occur at the second transition between L1 to L2, and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified (not involving any binary translations), the required changes were found in L0 only.

They optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0, the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if it is being modified. They carefully balance full copying versus partial copying and tracking. The VMCs are optimized further by copying multiple VMC fields at once. Normally, by intel's specification read or writes must be performed using the vmread and vmwrite instruction (operate on a single field). VMC's data can be accessed without the ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. This may not work on processors other than the ones that were used in testing. The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. Vmread and Vmwrite are used by the hypervisor to change the guest and host specification (causing a L1 exit multiple times while it handles a single L2 exit). By using AMD SVM, the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

Critique

Pros

The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers, and should not have too large of an effect on an end user running an application in a nested virtual machine. This is especially true if the user is using the system at a low depth. One can further argue that the most common use cases for nested virtualization that the authors mention in section 1, such as virtualizing OSs that are already hypervisors (like windows 7) and hypervisors in the cloud, will be at a shallow depth. It then follows that the testing the authors do in section 4 covers the most common use cases, so users can expect similar impressive performance. Nevertheless, contribution is visible with respect to security and compatibility. On the security side, this nested virtualization technique can be used to study hypervisor level rootkits, such as bluepill [6], by hosting an infected hypervisor as a guest on top of another hypervisor.

Since this is the first successful implementation of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in the nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.

Cons

The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization keeps garnering attention from the computing world. The performance hit is mainly imposed by exponentially generated exits which are caused when a nested guest traps, handing control to the lowest level hypervisor, which may hand off the trap to hypervisors above it before finally returning to the guest. So we can see that time-complexity might be an issue when multiple levels of virtualization are involved, the paper fails to demonstrate that the design is capable of handling a much deeper level of nesting.

Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5. This is because it can be difficult to predict how the system will react to large levels of nesting, because the increase in the number of traps and other performance killing problems can potentially be exponential as the nesting gets deeper. Another significant detriment is that the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5: "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested". This means that some of the techniques the authors use to increase performance are not reproducible on other systems, and so the generality of parts of their solution may be limited.

Style and Presentation

The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does an excellent job conveying the technical details. The paper does seem to assume a high level of background knowledge and familiarity with the subject, especially with some more technical points of the architecture the hardware uses to implement virtualization. For example, the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1, which may not be familiar to people not used to the technical language. The paper does, however, touch on a wide range of topics in the field of virtualization, including CPU, Memory and IO device virtualization. This wide scope means that many of the major components of virtualization are discussed, so in the process of understanding the paper one learns a lot about many different parts of the field.

Conclusion

Some could argue that the research showed in the Turtles Project is the first to achieve a significant level of efficieny in x86 nested-virtualization without altering the hardware, relying on improved techniques and mechanisms. This is a major improvement over the current available solutions. The paper has garnered a lot of attention and appraisal from the computing world including the Jay Lepreau best paper award. More recently, a paper on nested virtualization that examines trap forwarding has been published by the VMWare research lab [18], the paper makes references and uses some of the data that was published and documented in the Turtles Project.

References

[1] Tanenbaum, Andrew (2007). Modern Operating Systems (3rd edition), page 569.

[2] Popek & Goldberg (1974). Formal requirements for virtualizable 3rd Generation architecture, section 1: Virtual machine concepts

[3] Tanenbaum, Andrew (2007). Modern Operating Systems (3rd edition), page 574-576.

[4] Goldberg, P. Architecture of Virtual Machines. In Proceedings of the Workshop on Virtual Computer Systems, ACM pp. 74-112

[5] Berghmans, O. Nesting Virtual Machines in Virtualization Test Frameworks. Master's Thesis, Unversity of Antwerp, 2010.

[6] Presentation by Joanna Rutkowska, Black Hat Briefings 2006.

[7] Buytaert, Dittner & Rule. The best damn server virtualization book period. pages 16-18

[8] Clark, Fraser, Hand, Hansen, Jul, Limpach, Pratt & Warfield. Live migration of virtual machines. page 273-286

[9] Gillam, Lee (2010). Cloud Computing: Principles, Systems and Applications. page 26-27

[10] Sugerman, Venkitachalm & Lim. (2001). Virtualizing I/O devices on VMware workstation’s hosted virtual machine monitor.

[11] Russell, Rusty (2008). virtio: towards a de-facto standard for virtual I/O devices

[12] Barham, Dragovic, Fraser, Hand, Harris, Ho, Neugebauer, Pratt & Warfield (2003). Xen and the art of virtualization

[13] Levasseur, Uhlig, Stoess & Gotz (2004). Unmodified device driver reuse and improved system dependability via virtual machines.

[14] Yassour, Ben-Yehuda & Wasserman (2008). Direct device for untrsuted fully-virtualized virtual machines

[15] Proceedings of the 5th European Conference on Information Warfare and Security : National Defence College, Helsinki, Finland, 1 - 2 June 2006. p 245-246

[16] Amazon Web Services

[17] Tanenbaum, Andrew (2007). Modern Operating Systems (3rd edition), page 576-577

[18] Poon and K. Mok Bounding the running time of interrupt and exception forwarding in recursive virtualization for the x86 architecture