Soma-notes - User contributions [en]

WebFund 2015W: Assignment 5

2015-02-27T00:51:25Z

Mbingham:

In this assignment you will create <tt>queryNotes.js</tt>, a program that will query the notes collection for matching notes. Please submit your answers as a single JavaScript source file called "<username>-queryNotes.js" (where username is your MyCarletonOne username). Please do not submit a zip file (i.e., no need for a node_modules directory).

In total, there are 10 points. '''The questions below will be automatically graded for everything but style.'''

This assignment is due March 2, 2015.

===Syntax===

queryNotes.js should be called as follows:

node queryNotes.js [--output=<output file>] [--maxcount=<max number of documents to return>] <criteria> [<projection>]

Here is an example of a query that includes all the options:

node queryNotes.js --output=out.txt --maxcount=10 '{"owner": "Alice"}' '{"content": 1}'

The above would return at most 10 matching documents and output them to out.txt. The documents would consist of all those owned by Alice, with only the content field being returned (i.e, the projection). The format of the output should be the same as the format returned from calling toArray on the [http://docs.mongodb.org/manual/reference/method/db.collection.find/ find()] query (see Key Logic below) - an array of JSON objects.

===Arguments===
* The query is mandatory. It should be a single string that follows the syntax of a [http://docs.mongodb.org/manual/reference/method/db.collection.find/ find() criteria object] except that the enclosing curly braces are optional.
* The output file should default to standard out.
* If no maxcount is specified then all matching documents should be returned.
* The projection should default to all fields. The enclosing braces on the object should also be optional.

===Key logic===
To process the query and projection strings you should:
* add enclosing curly braces if they are not present,
* convert the string to an object using [https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/parse JSON.parse()], and then
* create a cursor object using [http://docs.mongodb.org/manual/reference/method/db.collection.find/ find()], giving it the supplied criteria and optionally the projection object, and
* retrieve all of the objects using toArray() on the cursor object (see examples of toArray() [https://www.npmjs.com/package/mongodb here]).
* report an error message to standard error saying "Error..." (start with the word Error and say whatever afterwards) if there are any problems with the operation.

===Scoring===
Points will be awarded as follows:
* 3 for doing the default find() properly using a query with braces
* 1 for handling missing curly braces
* 1 for --output working properly
* 1 for limiting the records correctly
* 1 for the projection (returning only certain fields)
* 1 for handling missing or malformed queries
* 2 for style
Total: 10 points

WebFund 2015W: Assignment 5

2015-02-27T00:34:38Z

Mbingham:

In this assignment you will create <tt>queryNotes.js</tt>, a program that will query the notes collection for matching notes. Please submit your answers as a single JavaScript source file called "<username>-queryNotes.js" (where username is your MyCarletonOne username). Please do not submit a zip file (i.e., no need for a node_modules directory).

In total, there are 10 points. '''The questions below will be automatically graded for everything but style.'''

This assignment is due March 2, 2015.

===Syntax===

queryNotes.js should be called as follows:

node queryNotes.js [--output=<output file>] [--maxcount=<max number of documents to return>] <criteria> [<projection>]

Here is an example of a query that includes all the options:

node queryNotes.js --output=out.txt --maxcount=10 '{"owner": "Alice"}' '{"content": 1}'

The above would return at most 10 matching documents and output them to out.txt. The documents would consist of all those owned by Alice, with only the content field being returned (i.e, the projection).

===Arguments===

* The query is mandatory. It should be a single string that follows the syntax of a [http://docs.mongodb.org/manual/reference/method/db.collection.find/ find() criteria object] except that the enclosing curly braces are optional.
* The output file should default to standard out.
* If no maxcount is specified then all matching documents should be returned.
* The projection should default to all fields. The enclosing braces on the object should also be optional.

===Key logic===
To process the query and projection strings you should:
* add enclosing curly braces if they are not present,
* convert the string to an object using [https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/parse JSON.parse()], and then
* create a cursor object using [http://docs.mongodb.org/manual/reference/method/db.collection.find/ find()], giving it the supplied criteria and optionally the projection object, and
* retrieve all of the objects using toArray() on the cursor object (see examples of toArray() [https://www.npmjs.com/package/mongodb here]).
* report an error message to standard error saying "Error..." (start with the word Error and say whatever afterwards) if there are any problems with the operation.

===Scoring===
Points will be awarded as follows:
* 3 for doing the default find() properly using a query with braces
* 1 for handling missing curly braces
* 1 for --output working properly
* 1 for limiting the records correctly
* 1 for the projection (returning only certain fields)
* 1 for handling missing or malformed queries
* 2 for style
Total: 10 points

WebFund 2015W: Assignment 5

2015-02-25T22:35:24Z

Mbingham:

'''This Assignment is not yet finalized'''

In this assignment you will create <tt>queryNotes.js</tt>, a program that will query the notes collection for matching notes. Please submit your answers as a single JavaScript source file called "<username>-queryNotes.js" (where username is your MyCarletonOne username). Please do not submit a zip file (i.e., no need for a node_modules directory).

In total, there are 10 points. '''The questions below will be automatically graded for everything but style.'''

This assignment is due March 2, 2015.

===Syntax===

queryNotes.js should be called as follows:

node queryNotes.js [--output=<output file>] [--maxcount=<max number of documents to return>] <criteria> [<projection>]

Here is an example of a query that includes all the options:

node queryNotes.js --output=out.txt --maxcount=10 "{owner: \"Alice\"}" "{content: 1}"

The above would return at most 10 matching documents and output them to out.txt. The documents would consist of all those owned by Alice, with only the content field being returned (i.e, the projection).

===Arguments===

* The query is mandatory. It should be a single string that follows the syntax of a [http://docs.mongodb.org/manual/reference/method/db.collection.find/ find() criteria object] except that the enclosing curly braces are optional.
* The output file should default to standard out.
* If no maxcount is specified then all matching documents should be returned.
* The projection should default to all fields. The enclosing braces on the object should also be optional.

===Key logic===
To process the query and projection strings you should:
* add enclosing curly braces if they are not present,
* convert the string to an object using [https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/parse JSON.parse()], and then
* create a cursor object using [http://docs.mongodb.org/manual/reference/method/db.collection.find/ find()], giving it the supplied criteria and optionally the projection object, and
* retrieve all of the objects using [http://docs.mongodb.org/manual/reference/method/cursor.toArray/ .toArray()] on the cursor object.
* report an error message to standard error saying "Error..." (start with the word Error and say whatever afterwards) if there are any problems with the operation.

===Scoring===
Points will be awarded as follows:
* 3 for doing the default find() properly using a query with braces
* 1 for handling missing curly braces
* 1 for --output working properly
* 1 for limiting the records correctly
* 1 for the projection (returning only certain fields)
* 1 for handling missing or malformed queries
* 2 for style
Total: 10 points

WebFund 2015W: Assignment 2

2015-02-09T15:11:55Z

Mbingham:

In this assignment you will be modifying the the [http://homeostasis.scs.carleton.ca/~soma/webfund-2015w/code/form-demo.zip form-demo] code from Tutorial 2.

Please submit your answers as a single zip file called "<username>-comp2406-assign2.zip" (where username is your MyCarletonOne username). This zip file should unpack into a directory of the same name (minus the .zip extension of course). This directory should contain your modified version of <tt>form-demo</tt> (one version with all code changes).

In addition to the points below, you also get up to 2 points for overall programming style. In total, there are 10 points. '''The questions below will be automatically graded for everything but style.'''

==Questions==

# [1] Make the application listen on port 3200 by default.
# [1] Add a "Home" link to the submit results screen (/add) that takes you back to the form screen. Note that this should be a link (a tag)!
# [2] Add a "Phone" field to the end of the form that behaves similarly to all of the other form entries, in all parts of the application.
#* the name attribute of the phone field should be "phone"
# [4] Implement a simple query interface that searches previous records for any record containing partial matches to the query in any of the filled in fields. For example, if the user inputs the string "test" and hits submit, the list screen should have all records which have "test" somewhere in one of its fields. Specifically, this interface should:
#* be a new screen accessible from the main page (/) from a button whose text says "Query"
#* be located at a route named /query
#* contain a form with a single text input whose name attribute is "input" and a single submit button whose action is "/doquery" using a POST request.
#* once submitted, return a listing screen (based on the already existing /list screen) that lists only the records which contain, in one or more of its fields, the input string as a substring.

==Solutions==
See [http://scs.carleton.ca/~mbingham/assn2-solutions.zip solution code].

WebFund 2015W: Assignment 2

2015-01-22T19:11:17Z

Mbingham:

'''This assignment is not yet finalized'''

In addition to the points below, you also get up to 2 points for overall programming style. In total, there are 10 points. '''The questions below will be automatically graded for everything but style.''' Further details on how to structure your output will be added here soon.

The following questions require you to modify the [http://homeostasis.scs.carleton.ca/~soma/webfund-2015w/code/form-demo.zip form-demo] code from Tutorial 2.

# [1] Make the application listen on port 3200 by default.
# [1] Add a "Home" link to the submit results screen (/add) that takes you back to the form screen. Note that this should be a link (a tag)!
# [2] Add a "Phone" field to the end of the form that behaves similarly to all of the other form entries, in all parts of the application.
#* the name attribute of the phone field should be "phone"
# [4] Implement a simple query interface that searches previous records for any record containing partial matches to the query in any of the filled in fields. This interface should:
#* be a new screen accessible from the main page (/) from a button whose text says "Query"
#* be located at a route named /query
#* contain a form with a single text input whose name attribute is "input" and a single submit button whose action is "/doquery" using a POST request.
#* once submitted, return a listing screen (based on the already existing /list screen) that lists only the records which contain, in one or more of its fields, the input string as a substring.

For example, if the user inputs the string "test" and hits submit, the list screen should have all records which have "test" somewhere in one of its fields.

WebFund 2015W: Assignment 2

2015-01-20T14:18:09Z

Mbingham:

'''This assignment is not yet finalized'''

In addition to the points below, you also get up to 2 points for overall programming style. In total, there are 10 points. '''The questions below will be automatically graded for everything but style.''' Further details on how to structure your output will be added here soon.

The following questions require you to modify the [http://homeostasis.scs.carleton.ca/~soma/webfund-2015w/code/form-demo.zip form-demo] code from Tutorial 2.

# [1] Make the application listen on port 3200 by default.
# [1] Add a "Home" link to the submit results screen (/add) that takes you back to the form screen. Note that this should be a link (a tag)!
# [2] Add a "Phone" field to the end of the form that behaves similarly to all of the other form entries, in all parts of the application.
# [4] Implement a simple query interface that searches previous records for any record containing partial matches to the query in any of the filled in fields. This interface should:
#* be a new screen accessible from the main page (/) from a button whose text says "Query"
#* be located at a route named /query
#* contain a form with a single text input whose name attribute is "input" and a single submit button whose action is "/doquery" using a POST request.
#* once submitted, return a listing screen (based on the already existing /list screen) that lists only the records which contain, in one or more of its fields, the input string as a substring.

For example, if the user inputs the string "test" and hits submit, the list screen should have all records which have "test" somewhere in one of its fields.

WebFund 2015W: Assignment 2

2015-01-19T19:50:15Z

Mbingham:

'''This assignment is not yet finalized'''

In addition to the points below, you also get up to 2 points for overall programming style. In total, there are 10 points. '''The questions below will be automatically graded for everything but style.''' Further details on how to structure your output will be added here soon.

The following questions require you to modify the [http://homeostasis.scs.carleton.ca/~soma/webfund-2015w/code/form-demo.zip form-demo] code from Tutorial 2.

# [1] Make the application listen on port 3200 by default.
# [1] Add a "Home" link to the submit results screen (/add) that takes you back to the form screen. Note that this should be a link (a tag)!
# [2] Add a "Phone" field to the end of the form that behaves similarly to all of the other form entries, in all parts of the application.
# [4] Implement a simple query interface that searches the previous records for any containing partial matches to any of the filled in fields. The button at the bottom should say "Query" and should cause a new query screen to be loaded. This new screen, once submitted, should return a listing screen (based on the original listing one) that lists only the selected records.

Operating Systems 2014F: Assignment 6

2014-12-12T15:21:44Z

Mbingham:

Please submit the answers to the following questions via CULearn by midnight on Wednesday, November 12, 2014. There 20 points in 14 questions.

Submit your answers as a single text file named "<username>-comp3000-assign6.txt" (where username is your MyCarletonOne username). The first four lines of this file should be "COMP 3000 Assignment 6", your name, student number, and the date of submission. You may wish to format your answers in [http://en.wikipedia.org/wiki/Markdown Markdown] to improve their appearance.

'''No other formats will be accepted.''' Submitting in another format will likely result in your assignment not being graded and you receiving no marks for this assignment. In particular do not submit a zip file, MS Word, or OpenOffice file as your answers document!

Don't forget to include what outside resources you used to complete each of your answers, including other students, man pages, and web resources. You do not need to list help from the instructor, TA, or information found in the textbook.

==Part A==

# [1] dd if=/dev/zero of=foo bs=8192 count=32K What is the logical size of the file? How much space does it consume on disk? (Hint: Look at the size option to ls.)
# [1] Run <tt>mkfs.ext4 foo</tt>. (Say "yes" to operating on a regular file.) Does foo consume any more space?
# [1] What command do you run to check the filesystem in foo for errors?
# [1] Run <tt>mount foo /mnt</tt>. How does this command change what files are accessible?
# [1] Run <tt>df</tt>. What device is mounted on /mnt? What is this device?
# [1] Run <tt>rsync -a -v /etc /mnt</tt>. What does this command do? Explain the arguments as well.
# [1] Run <tt>umount /mnt</tt>. What files can you still access, and what have gone away?
# [1] Run <tt>dd if=/dev/zero of=foo conv=notrunc count=10 bs=512</tt>. How does the "conv=notrunc" change dd's behavior (versus the command in question 1)?
# [1] Run <tt>sudo mount foo /mnt</tt>. What error do you get?
# [1] What command can you run to make foo mountable again? What characteristic of the file system enables this command to work?
# [1] Run the command <tt>truncate -s 1G bar</tt>. What is the logical size of bar, and how much space does it consume on disk? How does this compare with foo?
# [1] How does the logical size of bar change when you create an ext4 filesystem in it? What about the space consumed on disk?

==Part B==
<ol start="13">
<li>[4] Write your own version of the command line program stat, which simply calls the stat() system call on a given file or directory. Print out file size, number of blocks allocated, reference (link) count, and so forth. What is the link count of a directory, as the number of entries in the directory changes? Useful interfaces: stat()</li>
<li>[4] Write a program that prints out the last few lines of a file. The program should be efficient, in that it seeks to near the end of the file, reads in a block of data, and then goes backwards until it finds the requested number of lines; at this point, it should print out those lines from beginning to the end of the file. To invoke the program, one should type: mytail -n file, where n is the number of lines at the end of the file to print. Useful interfaces: stat(), lseek(), open(), read(), close().</li>
</ol>

==Solutions==
Solutions for this assignment were discussed in [[Operating Systems 2014F Lecture 22|Lecture 22]].

# Logical size is 268435456 bytes (as reported by ls -l). Physical size is 262144 blocks (as reported by ls -ls). By default ls assumes blocks are 1K in size; thus the physical space taken up by the file is 262144 * 1024 = 268439552 bytes. In other words, the physical size is 4096 bytes larger than the logical size. Note ext4 by default uses 4K blocks.
# No - it consumes less space. The logical size stays the same, but now it only consumes 8798208 bytes. In other words the physical size is around 3.3% of the logical size! (Note mkfs.ext4 somehow inserted "holes" into the file.)
# fsck.ext4. Just using the fsck command alone on a regular file like foo (in this environment) will only display the usage options.
# Anything previously in /mnt is now not available, and the files of the filesystem in foo are now available within /mnt.
# /dev/loop0. This is a loop block device. It allows a file such as foo (associated with the device) to be accessed as if it were a block device. (NOTE: In class we referred to this as the "loopback block" device, but to reduce confusion it is normally referred to as the loop device.)
# The command "rsync -a -v /etc /mnt" synchronizes the contents of /etc to /mnt. Specifically it only copies the files from the source (/etc) to the destination (/mnt) if they need to be; files that are already the same in both locations are not touched. The -a option means archive (use recursion and grab almost everything and preserve as much file metadata as possible). The -v option says to be verbose; without it rsync runs quietly (unless it encounters an error). Note that with these options files that exist in the destination but not in the source are preserved. If you want extraneous files and directories to be deleted in the destination, add "--delete --force".
# Any files that were previously in /mnt before the mount command are now accessible. Any files that were in /mnt (i.e, in the foo filesystem) are no longer accessible.
# conv=notrunc means that dd will not truncate the file after it is done writing; it leaves the rest of the file intact.
# error is: mount: you must specify the filesystem type
# fsck.ext4 foo (on some versions of fsck you may have to manually specify the location of a backup superblock)
# Logical size is 1G. Physical size is 0. In contrast, physical size (when initially created) was just a bit larger than the logical size.
# The logical size stays the same. The physical size increases (to 33,968,128 bytes).
# Two parts:
#* Every subdirectory added The command "man 2 stat" describes the stat API. For a full version of a C program parsing what stat returns (and a bit more) see: http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/stat.c
#* A directory has a link count that starts at 2: one is the normal one coming from the entry in the parent directory, while the second comes from the "." entry (the directory links to itself). Every subdirectory added increases the link count by one, because of the .. entry in each subdirectory.
# You need to create a program that reads a chunk of the file from the end of the file, counts the lines in it, and then repeats until it finds the position of the start of the number of lines to be output. It then should read and output the lines from that position in the file (potentially using previously read data or reading it again). See the file_lines() function in the GNU coreutils tail implementation: http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/tail.c#n481 Note that the rest of this implementation of tail handle things like reading from a pipe and reading the last lines of a file forever (the -f option).

Operating Systems 2014F Lecture 5

2014-09-24T15:48:25Z

Mbingham: Created page with "'''Introduction''' previous class: * Previous lecture we learned that its very difficult to write a scheduler * tradeoff between response time and turnaround time * F..."

'''Introduction'''

previous class:

* Previous lecture we learned that its very difficult to write a scheduler
* tradeoff between response time and turnaround time
* FIFO, SJF, RR, MLFQ - an attempt to balance turnaround time and response time by approximating SJF

Why is it so hard to write a scheduler? The amount of information the scheduler knows about a process is very limited on a commodity operating system. Some applications are different - ‘real time OSs’
What we want the scheduler to do is very complicated, and is often conflicting - we want it to be responsive/interactive, but we also want our jobs to get done.
We have to deal with a lot of corner cases - what if a long running background process suddenly becomes highly interactive? What if an application developer tries to game the scheduler? What if we have processes that should have arbitrarily high priority for no discernable reason? These corner cases lead to complicated and inelegant solutions like priority boosting and accounting.

in this class:

we will examine an alternative approach to scheduling which optimizes for proportionality, rather than turnaround or response time.
we will discuss the concerns with scheduling processes in a multi-processor environment.

'''Proportional-Share Scheduling'''

* Let’s forget turnaround and response time. What if we wanted to optimize for proportionality (i.e, ensure that each process receives a proportional share of the cpu, where the exact proportion is defined by us).
Lottery Scheduling is one way to accomplish this.

Lottery Scheduling

Basic unit is the ticket. Each process has a certain number of tickets. The amount of tickets represent the share of the resources the process should have.
Let’s have two processes - A and B. We want A to have 75% of the resources, and B to have 25%.
How can we implement this?
Scheduler generates a random number n from 0 to 99. Tickets 0 to 74 belong to process A, tickets 75 to 99 belong to process B.

this is easily implemented as a singly linked list. Image three process, A, B and C with 100 tickets each (draw three circles on board). Counter goes through and increments by the number of tickets at each node. When it goes above the ticket number we run that process.
design consideration - what if A and B have 1 ticket, and C has 100? Then we should sort the list so that we have the highest number of tickets first, etc. This reduces the number of checks we have to do.

Lottery Scheduling - Ticket Mechanisms

This abstraction of a ticket is more useful then it may seem.
* ticket currency - process/user can allocate tickets to subjobs, perhaps in a different ratio (or currency)
* ticket transfer - I can give my tickets to another process. Imagine a client/server setting, where the client hands off a job to the server, the transfers their tickets to the server so it can finish the job.
* ticket inflation - process can temporarily raise or lower the number of tickets it owns. Note this is only possible in a non-competitive scenario.
Okay, great. So we have all these tickets, but how do we assign them?
Potential solutions: give each user an equal share of tickets, have programs assign themselves a certain number of tickets, etc. Whatever we do, there really isn’t a good way to assign tickets. It can work in some specialized cases, but probably won’t be the best for a general purpose OS.

Deterministic Lottery Scheduling

Possible problems with using a non-deterministic scheduler - though on average the system will probabilistically reach the run-time proportions we want, over the short term it may randomly bias towards one process or user
However, notice that we can transform probabilistic lottery scheduling into deterministic lottery scheduling pretty easily.
Stride scheduling - each process has a stride, which is inverse in proportion to the number of tickets it has (A -> 100, B -> 50, C -> 250 devide by 10,000 A -> 100, B -> 200, C -> 40).
Every time a process runs, increment a counter (the pass value) by its stride. Choose the process which has the lowest pass value (if two are equal then pick between them in some arbitrary way)

Why pick randomness or determinism?

Why use randomness?
1. random scheduling does not need to remember state (compare to the amount of state that is kept in an MLFQ). 2. randomness can avoid strange corner cases with weird behaviour (because it doesn’t remember state) 3. simple to implement 4. calculating which process should run next is very quick - O(1)
Why not use randomness?
Can’t guarantee that processes will have equal share over the short term - only that over the long term it will tend towards the proportions we want.
Typically, this solution isn’t used because of how difficult it is to figure out how to assign tickets. But one example where it can be very well used is something like a VM environment where we want 3 / 4 to go to the host and 1 / 4 to go to the VM.

----------------------------------------------------------------------------------------------------------------

'''Multiprocessor Scheduling'''

All the algorithms we have previously looked at assume that there is one processor sharing the resources. In most modern computers there are more likely two, four or even more.
Important concept in a multiprocessor environment - caches and memory hierarchy. Processors keep a local cache of frequently accessed data.
* temporal locality - data that has been accessed in the past is likely to be accessed again in the future. Think of a loop or an in-scope variable.
* spatial locality - data that is near other data is likely to be accessed - think of an array.
These caches impact scheduling in two important ways.
There’s an entire field of research on how to keep these caches synchronized across different processors called cache coherence. It is difficult and requires a lot of locking.
Also there is cache affinity. Processes build up state on a processor (in the caches, in the TLBs (i.e, cache for memory) etc). Moving them between processors frequently can decrease performance.

First Solution - Single Queue Multiprocessor Scheduling (SQMS) 

Have a single queue of jobs. Then use whatever other algorithm we want to pick the best one (or two) jobs to run next when a processor is free
One advantage: very conceptually simple
* Problem 1. is not scalable. The single queue is a shared data structure and so must be locked every time it is accessed. As the number of processors increases it will be locked more and more frequently until it is basically locked all the time.
* Problem 2 is cache affinity. Imagine we have five jobs (A, B, C, D, and E) and four processors. (draw the sitch and explain how it is bad).
Can solve problem 2 by having affinity - have four of the jobs stay on their processors and have only one job migrating.

Second Solution - Multi Queue Multiprocessor Scheduling (MQMS) 
Each processor has its own queue. When a new job comes into the system it is put on exactly one queue. Then each queue is scheduled independently.
advantage - scalable with the number of processors (b/c each one has its own queue). Also inherently provides cache affinity since jobs stay on the same processor.

However, this introduces a new problem - load imbalance. Let’s say two queue’s with two jobs each. One queue has one job finish - now the other job gets all that processing time. If that job finishes to, then a cpu is sitting empty. The general solution to this is migration. If one queue is empty then its easy - just move one job to that cpu. However, its trickier for other cases. If cpu A has one job and cpu B has two jobs what do we do? Best solution is probably continuous migration - i.e keep moving one job between the two queues.

In the world of linux schedulers, both approaches (SQMS and MQMS) are used to multiprocessor environments.

'''Conclusion'''

Proportional share scheduling
* Advantages generally: “fair” solution - gives each process the proportion of the system that it is assigned to have.
* Disadvantage generally: leaves open the problem of ticket assignment. Doesn’t mesh well with I/O.

Lottery Scheduling
* Advantages: adv. simple, quick, no state

Stride Scheduling
* Advantages: deterministic solution in all cases

SQMS
* Advantages: conceptually simple.
* Disadvantages: not scalable and no cache affinity

MQMS
* Advantages: scales well and has inherent cache affinity
* Disadvantages: load imbalance problem

'''The Takeaway'''
Takeaway from all our lectures on scheduling - scheduling is a very difficult problem. It is full of all sorts of different tradeoffs.

WebFund 2014W: Tutorial 4

2014-01-31T13:59:23Z

Mbingham:

In this tutorial we will be playing with a program that is similar to last tutorial's [[WebFund 2014W: Tutorial 3|sessions demo]] except that we now:
* authenticate the user with a password,
* secure communication using https (using a self-signed SSL certificate), and
* have persistence across server restarts.

Because we are using SSL, you will need to connect to https://localhost:3000 rather than the standard http address. You will also get a warning about the self-signed certificate; this is normal. However, you may want to try examining the certificate to see what information it contains.

The sample express application is [http://homeostasis.scs.carleton.ca/~soma/webfund-2014w/T4/auth-ssl-demo.zip auth-ssl-demo].

If you were doing authentication in a real application, you should probably use a more mature solution like [http://everyauth.com/ everyauth] or [http://passportjs.org/ Passport]; however, this solution does follow standard practice of storing the password in a form that is (somewhat) hard to reverse (hashed and salted) and we are only transmitting it over an encrypted channel.

You should get the application running, look at the code, and then attempt to answer the questions below about the code and make the suggested modifications.

===Running the code outside of the class VM===

Note that if you are running this code outside of the class VM, you will probably need to delete the node_modules directory and run <tt>npm install</tt> as some of the modules for this class use native code. In particular, this code uses OpenSSL's implementation of bcrypt.

<tt>npm install</tt> should work fine on Linux as they often include OpenSSL as part of the standard development environment.
Thus building this on Windows machines can be tricky, however, as OpenSSL is not installed normally; you'll have to install it separately in addition to Visual Studio. See [https://npmjs.org/package/bcrypt the node bcrypt package documentation] for more information on how to get it to run on Windows.

A reasonable question here is, why not use a JavaScript implementation of the crypto primitives? They do exist; however, you should always use CERTIFIED IMPLEMENTATIONS of cryptography in your applications. If it hasn't been properly tested and evaluated, you are running very very serious risks. Friends don't let friends implement cryptography for anything except personal entertainment!

Having said that, you should be able to get the code working using pure JavaScript with [https://npmjs.org/package/bcryptjs bcryptjs] or [https://npmjs.org/package/bcrypt-nodejs bcrypt-nodejs] packages with minor changes to the application.

==Questions==

You will get full credit for this tutorial for attending and showing a TA that you can at least answer a few of the questions below. You are highly encouraged, though, to try and answer all of the following during tutorial.

# What is the difference between the Login and Register button on the initial screen? Do they work the same way?
# Generate your own SSL certificate for the application. How do you know you succeeded?
# MongoDB's "tables" are collections; they are grouped together into databases. What MongoDB database is used by this application? What collections?
# How long before this app's session cookies expire?
# Do sessions and user accounts persist across web application restarts?
# Once the application is running successfully, kill the MongoDB server and see how the application behaves when you attempt to register a new user. Does it "succeed" or does it report an error? Is the user properly registered? (You can stop and start the server in the VM using the command <tt>sudo service mongodb stop</tt> and <tt>sudo service mongodb start</tt>, respectively.)
# In the POST function for /login, it processes a username and password supplied by the user. How are they accessed? Where did this information come from? And, are they validated in any way?
# Why are there three arguments to the app.get()'s, rather than the previous two?
# How can you change this app to list all of the currently logged in users on /users?
# The <tt>routes.register()</tt> has multiple nested functions. What do each of them do, and why are they nested the way they are?
# What is <tt>toArray()</tt> doing in the calls to <tt>find()</tt>? Specifically, what does the syntax mean, and why is the toArray() call necessary?
# Type <tt>mongo</tt> to connect to the running mongodb instance on your machine. What do the following commands do? What does this show you about how passwords are stored in this application?
help
show dbs
show collections
use auth-hash-demo
db.sessions.find()
db.users.find()
db.system.indexes.find()

WebFund 2014W: Tutorial 1

2014-01-13T19:39:15Z

Mbingham:

For this class you will be using a [http://www.lubuntu.net/ Lubuntu] virtual machine appliance. We will be using [https://www.virtualbox.org/ VirtualBox] as our preferred virtualization platform; however, VMware Workstation/Fusion and other virtualization platforms should be able to run the appliance as well. In this first tutorial you will be becoming familiar with [http://nodejs.org/ node.js]-based development environment provided by this appliance.

To get credit for this lab, show a TA or the instructor that you have gotten the class VM running, made simple changes to your first web app, and that you have started lessons on CodeAcademy (or convince them you don't need to).

If you finish early (which you are likely to do), try exploring node and the Lubuntu environment. You will be using them a lot this semester!

==Running the VM==

In the SCS labs you should be able to run the VM by starting Virtualbox (listed in the Applications menu) and selecting the COMP 2406 virtual machine. After the VM has fully booted up you can login to the student account using the password "tneduts!". This account has administrative privileges; in addition, there is the admin account in case your student account gets corrupted for any reason. The password for it is "nimda!".

We highly recommend running your VM in full-screen mode. (Don't maximize the window; instead select full screen from the view menu.) Do all of your work inside of the VM; it should be fast enough and you won't have any issues with sharing files or with host firewalls.

If you want to run the appliance on your own system (running essentially any desktop operating system you want), just download the
[http://homeostasis.scs.carleton.ca/~soma/webfund-2014w/COMP%202406.ova COMP 2406.ova] [http://people.scs.carleton.ca/~soma/webfund-2014w/COMP%202406.ova (alternate source)] virtual appliance file and import. The SHA1 hash of this file is:

3c0ad1c89d58b5b9b1225a3a7c876a500e0621a8 COMP 2406.ova

On Windows you can compute this hash for your downloaded file using the command [http://support.microsoft.com/kb/889768 <tt>FCIV -sha1 COMP 2406.ova</tt>]. If the hash is different from above, your download has been corrupted.

If your virtualization application is not VirtualBox, you'll need to:
* Have the VM platform ignore any errors in the structure of the appliance when importing;
* Uninstall the VirtualBox guest additions by typing starting a terminal application and running
sudo /opt/VBoxGuestAdditions-4.2.16/uninstall.sh
* Install your platform's own Linux guest additions, if available.

Note as we will explain, you will have the ability to easily save the work you do from any VM to your SCS account and restore it to any other copy of the class VM. Thus feel free to play around with VMs; if you break anything, you can always revert. Remember though that in the labs you '''must''' save and restore your work, as all of your changes to the VM will be lost when you logout!

While you may update the software in the VM, those updates will be lost when you next login to the lab machines; thus, you probably only want to update a VM installed on your own system.

==Hello, World!==

To create your first node application, start [http://www.geany.org/ geany], [http://brackets.io/ brackets], or [http://www.gnu.org/software/emacs/ emacs] code editors by clicking on their quick launch icons at the bottom left of the screen (beside the LXDE start menu button).

(While vi is also installed in the VM, you may wish to run emacs and type Alt-X [http://www.gnu.org/software/emacs/manual/html_mono/viper.html viper-mode]. You're welcome.)

In your editor of choice, create a file <tt>hello.js</tt> in your Documents folder with the following contents:

console.log("Hello World!");

You can now run this file by opening an LXTerminal (under Accessories) and typing:

cd Documents
node hello.js

And you should see <tt>Hello, World!</tt> output to your terminal.

You can also run node interactively by simply running <tt>node</tt> with no arguments. You'll then get a prompt where you can enter any code that you like and see what it does. To exit this environment, type Control-D.

Note that when run interactively, we say that node is running a [http://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop read-eval-print loop] (REPL). It reads input, evaluates it, and then prints the results. This structure is very old in computer science, going back to the first LISP interpreters from the early 1960's.

==Your First Web App==

Web applications, even simple ones, are a bit more complex than our "Hello, world!" example. Fortunately in node we have the [http://expressjs.com/ express] web application framework to make getting up and running quite easy.

In a terminal window, run the following commands:

express first
cd first
npm install
node app.js

You should get a message at the end saying that your app is listening on port 3000. To see what your app is doing, start up a web browser in your VM and visit the following URL:

http://localhost:3000

You should see a message from your first web application!

'''Note:''' It seems that doing the above can generate errors due to recent changes in the Jade template engine. Also, npm may run very slowly due to networking issues. So alternately, do the following:

# simulate "express first" (create application skeleton)
wget https://homeostasis.scs.carleton.ca/~soma/webfund-2014w/T1/first.zip
unzip first.zip
rm first.zip

cd first

wget https://homeostasis.scs.carleton.ca/~soma/webfund-2014w/T1/node_modules.zip
unzip node_modules.zip
rm node_modules.zip

# make sure dependencies are fulfilled; should return quickly
npm install

node app.js

==Simple Changes==

Now that you have an app up and running, make the following simple changes:
* Change the default port to 2000 (<tt>app.js</tt>)
* Change the title to "My First Web App" (<tt>routes/index.js</tt>)
* Prevent the default stylesheet <tt>style.css</tt> from being loaded (<tt>views/layout.jade</tt>)

==Saving your work==

You can save your work to your SCS account by running

save2406 <SCS username>

This will rsync /home/student to the COMP2406 directory in your SCS account by connecting to access.scs.carleton.ca.

When you wish to restore your student account, run

restore2406 <SCS username>

Note that both of these commands are destructive - they will wipe out all the files in the COMP2406 folder on SCS or /home/student in your VM. If you want to see what the differences are between the two versions, run

compare2406 <SCS username>

==CodeAcademy==

Now that you've got your virtual machine running, it is time to start learning about web technologies. In tutorial (if you have time) or on your own, you should either go through or make sure you know the material in all of the following CodeAcademy modules:
* [http://www.codecademy.com/tracks/web Web Fundamentals]
* [http://www.codecademy.com/tracks/javascript/ Javascript]
* [http://www.codecademy.com/tracks/jquery jQuery]
* [http://www.codecademy.com/tracks/projects Web Projects]

Feel free to skip around; these should be very simple for you, at least at the beginning. Try to do the last parts of each lesson to see if need to bother going through it. You'll be expected to know most of this material in order to successfully complete the first assignment.

WebFund 2013F: Tutorial 8

2013-10-25T14:16:03Z

Mbingham:

'''This tutorial is not yet finalized'''
In this tutorial you will be modifying [https://homeostasis.scs.carleton.ca/~soma/webfund-2013f/letterpaint-demo.zip letterpaint-demo]. Note the client-side code is from [https://hacks.mozilla.org/2013/06/building-a-simple-paint-game-with-html5-canvas-and-vanilla-javascript/ this HTML5 canvas tutorial].

First, download the app, unzip it, and get it running as usual. Note this app is just the default express application with the letterpaint game added to the public/ directory (and minor modifications to the index page).

Do the following:

* First, play with the game normally and using the developer tools to inspect the page.
* Then, add a quit button:
** Add the button to the page in <tt>index.html</tt> (make it the letter Q) similar to the information and sound buttons.
** Style the quit button in a fashion similar to the info button in <tt>letterpaint.css</tt> (look for the #infos attributes). Note you will have to change the offset; otherwise the button will overlap with the info button. (If you do not customize the CSS for the button at all, it will take on the default navbutton properties and overlap with the sound button.)
** In <tt>letterpaint.js</tt>:
*** Add a quitbutton variable similar to the one for sound or info.
*** Add an event listener for the quit button, referring to a <tt>quit()</tt> callback function.
*** Define <tt>quit()</tt>, having it set the value of <tt>window.location</tt> to be the URL you wish to visit when you quit ("/").
* Finally, do one or more of the following tasks:
** Make the info screen be dismissed when you click anywhere on it (rather than just by clicking on the info button again).
** Add a counter to the bottom right that tracks the score - how many letters have been traced correctly.
** '''Bonus (HARD):''' Send scores to the server on quit. Save the highest score and have that displayed somewhere in the letterpaint game (say, in the middle of the top bar). The high score should be updated from the server at the start of every game.

WebFund 2013F: Tutorial 4

2013-09-27T15:06:20Z

Mbingham: /* Node debugging */

'''This lab is not yet finalized'''

In this lab you will be learning the basics of debugging Node-based web applications. All of the following assumes you have <tt>form-demo</tt> setup and running from [[WebFund 2013F: Assignment 1|Assignment 1]].

==Browser-based debugging==
* Firefox: Tools->Web Developer->Toggle Tools
* Chrome/Chromium: Tools->Developer Tools

Select Network tab to see HTTP traffic
Select Inspector (Firefox) or Elements (Chrome/Chromium) to see HTML document

==Node debugging==

Node has a built-in debugger[http://nodejs.org/api/debugger.html]. Start it by running <tt>node debug app.js</tt>. This will stop on the first line of the file. Type <tt>n</tt> to step to the next line of the file. Type <tt>c</tt> to continue to the next breakpoint. Breakpoints are set by adding a <tt>debugger;</tt> statement to the javascript source.

At any time you can type <tt>repl</tt> into the debugger to drop into a read-eval-print loop where you can evaluate JavaScript statements in the current context. Ctrl-C will get you out of the REPL.

For example, given the source
<source line lang="javascript">
var x = 5;
var y = 10;

debugger;
</source>
You can run <tt>node debug app.js</tt>. This will start the debugger which will stop on the first line of the file (<tt>var x = 5;</tt>). If I enter <tt>c</tt> node will continue executing until the <tt>debugger;</tt> statement where it will stop. From here if you enter <tt>repl</tt> you can execute Javascript in the current context. In the <tt>repl</tt> prompt if you enter <tt>x;</tt> it will return 5. If you enter <tt>x + y;</tt> it will return 15, etc.

More commands:
*<tt>s</tt> step in
*<tt>o</tt> step out
*list(x) show x number of lines around current line

==Brackets and Theseus==

==Tasks==
* Observe the request and response for the app's home page (http://localhost:3010). Look at both the network panel (load the page ''after'' selecting the network panel) and the HTML DOM view (Inspector/Elements)
* Observe the contents of the form submit POST request: how much data is sent to the server? Observe it both from the browser side (to see what is sent) and inside of node, particularly where the POST results are returned.
* Look at other web pages!

WebFund 2013F: Tutorial 4

2013-09-27T15:04:25Z

Mbingham: /* Node debugging */

'''This lab is not yet finalized'''

In this lab you will be learning the basics of debugging Node-based web applications. All of the following assumes you have <tt>form-demo</tt> setup and running from [[WebFund 2013F: Assignment 1|Assignment 1]].

==Browser-based debugging==
* Firefox: Tools->Web Developer->Toggle Tools
* Chrome/Chromium: Tools->Developer Tools

Select Network tab to see HTTP traffic
Select Inspector (Firefox) or Elements (Chrome/Chromium) to see HTML document

==Node debugging==

Node has a built-in debugger[http://nodejs.org/api/debugger.html]. Start in by running <tt>node debug app.js</tt>. This will stop on the first line of the file. Type <tt>n</tt> to step to the next line of the file. Type <tt>c</tt> to continue to the next breakpoint. Breakpoints are set by adding a <tt>debugger;</tt> statement to the javascript source.

At any time you can type <tt>repl</tt> into the debugger to drop into a read-eval-print loop where you can evaluate JavaScript statements in the current context. Ctrl-C will get you out of the REPL.

For example, given the source
<source line lang="javascript">
var x = 5;
var y = 10;

debugger;
</source>
You can run <tt>node debug app.js</tt>. This will start the debugger which will stop on the first line of the file (<tt>var x = 5;</tt>). If I enter <tt>c</tt> node will continue executing until the <tt>debugger;</tt> statement where it will stop. From here if you enter <tt>repl</tt> you can execute Javascript in the current context. In the <tt>repl</tt> prompt if you enter <tt>x;</tt> it will return 5. If you enter <tt>x + y;</tt> it will return 15, etc.

More commands:
*<tt>s</tt> step in
*<tt>o</tt> step out
*list(x) show x number of lines around current line

==Brackets and Theseus==

==Tasks==
* Observe the request and response for the app's home page (http://localhost:3010). Look at both the network panel (load the page ''after'' selecting the network panel) and the HTML DOM view (Inspector/Elements)
* Observe the contents of the form submit POST request: how much data is sent to the server? Observe it both from the browser side (to see what is sent) and inside of node, particularly where the POST results are returned.
* Look at other web pages!

COMP 3000 Essay 2 2010 Question 9

2010-12-03T03:17:28Z

Mbingham: /* Conclusion */ expanded on conclusion

'''''Go to discussion for group members confirmation, general talk and paper discussions.'''''

=Paper=

<center><big><big>'''"The Turtles Project: Design and Implementation of Nested Virtualization"'''</big></big></center>

'''Authors:'''
* Muli Ben-Yehuday +
* Michael D. Day ++
* Zvi Dubitzky +
* Michael Factor +
* Nadav Har’El +
* Abel Gordon +
* Anthony Liguori ++
* Orit Wasserman +
* Ben-Ami Yassour +

'''Research labs:'''

+ IBM Research – Haifa

++ IBM Linux Technology Center

'''Website:''' http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

'''Video presentation:''' http://www.usenix.org/multimedia/osdi10ben-yehuda [Note: username and password are required for entry]

=Background Concepts=

Before we delve into the details of our research paper, its essential that we provide some insight and background to the concepts
and notions discussed by the authors.

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest virtual machine the illusion that its running on the bare hardware. But the reality is, we're actually running the virtual machine as an application on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guests among one another, and with the host hardware and operating system. It also controls host resources. [2]

====Nested virtualization====
The concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.|left|400px]]

====Protection rings====
In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3.
Ring 0 (root mode) is the most privilged level, allowing access to the bare hardware components. The operating system kernel must
execute in Ring 0 in order to access the hardware and secure control. User programs execute in Ring 3 (guest mode). Ring 1 and Ring 2 are dedciated to device drivers and other operations.

====Para-virtualization====
A virtualization model that requires the guest OS kernel to be modified in order to have some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, it rather relies on a software interface that we must implement in the guest kernel so that it can have some privileged hardware access via special instructions called hypercalls. The advantage here is that we have less environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]

===Models of virtualization===

=====Trap and emulate model=====
The trap and emualte model is based on the idea that when a guest hypervisor attempts to execute higher level instructions or access privileged hardware components, it triggers a trap or a fault which gets handled or caught by the host hypervisor. Based on the hardware model of virtualization support, the host hypervisor (L0) then determines whether it should handle the trap or whether it should forwards it to the the responsible parent of that guest hypervisor at a higher level.

====Models of hardware support====

=====Multiple-level architecture=====
Every parent hypervisor handles every other hypervisor running on top of it. For instance, assume that L0 (host hypervisor) runs the VM L1. When L1 attempts to execute a privileged instruction and a trap occurs, then the parent of L1, which is L0 in this case, will handle the trap. If L1 runs L2, and L2 attempts to execute privileged instructions as well, then L1 will act as the trap handler. More generally, every parent hypervisor at level Ln will act as a trap handler for its guest VM at level Ln+1. This model is not supported by the x86 based systems that are discussed in our research paper.

=====Single-level architecture=====
The model supported by x86 based systems. In this model, everything must go back to the main host hypervisor at the L0 level. For instance, if the host hypervisor (L0) runs L1, when L1 attempts to run its own virtual machine L2, this will trigger a trap that goes down to L0. Then L0 sends the result of the requested instruction back to L1. Generally, a trap at level Ln will be handled by the host hypervisor at level L0 and then the resulting emulated instruction goes back to Ln.

===The uses of nested virtualization===

====Compatibility====
A user can run a particular application or OS thats not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibily mode of other operating systems or applications, an example of this is the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware.

====Security====
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is basically a hollow program or network that appears to be functioning to outside users, but in reality, its only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machines and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored.

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

=Research problem=

Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have also recently been software based solutions [5], however these solutions suffer from significant performance problems.

The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the numer of control switches between different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it potentially in the worst case reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.

Generally, solutions that requie architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It is for the most part able to contain the problem of a single nested trap expanding into many more trap instructions, at least for the nesting depth the authors considered, which allows efficient virtualization without architectural support.

More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example the paper Accountable Virtual Machines wraps programs around a particular state VM which could most definitely be placed on a separate hypervisor for ideal isolation.

==Theory==

==CPU Virtualization==
How does the Nest VMX virtualization work for the turtle project: L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle. The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously. -csulliva

==Memory virtualization==

How Multi-dimensional paging work on the turtle project. The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

==I/O virtualization==

How does I/O virtualization work on the turtle project? There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

==Macro optimizations==
How they implement the Micro-Optimizations to make it go faster on the turtle project? The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

=Critique=

=== The pros ===

The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers, and should not have too large of an effect on an end user running an application in a nested virtual machine. This is especially true if the user is using the system at a low depth. One can further argue that the most common use cases for nested virtualization that the authors mention in section 1, such as virtualizing OSs that are already hypervisors (like windows 7) and hypervisors in the cloud, will be at a shallow depth. It then follows that the testing the authors do in section 4 covers the most common use cases, so users can expect similar impressive performance. Nevertheless contribution is visible with respect to security and compatibility. On the security side, this nested virtualization technique can be used to study hypervisor level rootkits, such as bluepill [6], by hosting an infected hypervisor as a guest on top of another hypervisor. Since this is the first successful implementation of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in the nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.

=== The cons ===

The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization enters our lives. The performance hit is mainly imposed by exponentially generated exits which are caused when a nested guest traps, handing control to the lowest level hypervisor, which may hand off the trap to hypervisors above it before finally returning to the guest. Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5. This is because it can be difficult to predict how the system will react to large levels of nesting, because the increase in the number of traps and other performance killing problems can potentially be exponential as the nesting gets deeper. Another significant detriment is that the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5: "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested". This means that some of the techniques the authors use to increase performance are not reproducible on other systems, and so the generality of parts of their solution may be limited.

=== The style and presentation ===

The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does a good job conveying the technical details. The paper does seem to assume a high level of background knowledge and familiarity with the subject, especially with some more technical points of the architecture the hardware uses to implement virtualization. For example, the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1, which may not be familiar to people not used to the technical language. The paper does, however, touch on a wide range of topics in the field of virtualization, including CPU, Memory and IO device virtualization. This wide scope means that many of the major components of virtualization are discussed, so in the process of understanding the paper one learns a lot about many different parts of the field.

=== Conclusion ===

The research showed in the paper is the first to achieve efficient x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. This is a major improvement over the current available solutions, and the techniques used to achieve nested virtualization are comprehensive and interesting. It also has good potential as a basis for future research. The authors refer to security and clouds as two potential areas for future research, another interesting area could be how the approaches the authors apply, the way they compress multiple levels of abstraction into one level with multi-dimensional paging and device assignment, could be applied to other problems that involve nesting. The paper also won the best paper award at the conference, further reflecting its quality.

=References=
[1] Tanenbaum, Andrew (2007).'' Modern Operating Systems (3rd edition)'', page 569.

[2] Popek & Goldberg (1974). [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CCkQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.141.4815%26rep%3Drep1%26type%3Dpdf&ei=uxD4TL_OOYeSswbbydzZCA&usg=AFQjCNEavbxNIe4sUwidBvE_3S8MXY3fHg&sig2=BS1tG9eadLRrKVItvb6gBg ''Formal requirements for virtualizable 3rd Generation architecture, section 1: Virtual machine concepts'' ]

[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.

[4] Goldberg, P. [http://portal.acm.org/citation.cfm?id=800122.803950 Architecture of Virtual Machines]. In ''Proceedings of the Workshop on Virtual Computer Systems'', ACM pp. 74-112

[5] Berghmans, O. Nesting Virtual Machines in Virtualization Test Frameworks. Master's Thesis, Unversity of Antwerp, 2010.

[6] Presentation by Joanna Rutkowska, Black Hat Briefings 2006.

COMP 3000 Essay 2 2010 Question 9

2010-12-03T03:05:31Z

Mbingham: /* The cons */ another sentence

'''''Go to discussion for group members confirmation, general talk and paper discussions.'''''

=Paper=

<center><big><big>'''"The Turtles Project: Design and Implementation of Nested Virtualization"'''</big></big></center>

'''Authors:'''
* Muli Ben-Yehuday +
* Michael D. Day ++
* Zvi Dubitzky +
* Michael Factor +
* Nadav Har’El +
* Abel Gordon +
* Anthony Liguori ++
* Orit Wasserman +
* Ben-Ami Yassour +

'''Research labs:'''

+ IBM Research – Haifa

++ IBM Linux Technology Center

'''Website:''' http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

'''Video presentation:''' http://www.usenix.org/multimedia/osdi10ben-yehuda [Note: username and password are required for entry]

=Background Concepts=

Before we delve into the details of our research paper, its essential that we provide some insight and background to the concepts
and notions discussed by the authors.

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest virtual machine the illusion that its running on the bare hardware. But the reality is, we're actually running the virtual machine as an application on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guests among one another, and with the host hardware and operating system. It also controls host resources. [2]

====Nested virtualization====
The concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.|left|400px]]

====Protection rings====
In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3.
Ring 0 is the most privilged level, allowing access to the bare hardware components. The operating system kernel must
execute in Ring 0 in order to access the hardware and secure control. User programs execute in Ring 3. Ring 1 and Ring 2 are dedciated to device drivers and other operations.

In virtualization, the host hypervisor executes in Ring 0. While the guest virtual machine should normally executes in Ring 3, when the guest attempts to access privileged hardware components, the hypervisor comes into play and handles the situation and eventually runs the guest in Ring 0.

====Para-virtualization====
A virtualization model that requires the guest OS kernel to be modified in order to have some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, it rather relies on a software interface that we must implement in the guest kernel so that it can have some privileged hardware access via special instructions called hypercalls. The advantage here is that we have less environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]

===Models of virtualization===

=====Trap and emulate model=====
The trap and emualte model is based on the idea that when a guest hypervisor attempts to execute higher level instructions or access privileged hardware components, it triggers a trap or a fault which gets handled or caught by the host hypervisor. Based on the hardware model of virtualization support, the host hypervisor (L0) then determines whether it should handle the trap or whether it should forwards it to the the responsible parent of that guest hypervisor at a higher level.

====Models of hardware support====

=====Multiple-level architecture=====
Every parent hypervisor handles every other hypervisor running on top of it. For instance, assume that L0 (host hypervisor) runs the VM L1. When L1 attempts to execute a privileged instruction and a trap occurs, then the parent of L1, which is L0 in this case, will handle the trap. If L1 runs L2, and L2 attempts to execute privileged instructions as well, then L1 will act as the trap handler. More generally, every parent hypervisor at level Ln will act as a trap handler for its guest VM at level Ln+1. This model is not supported by the x86 based systems that are discussed in our research paper.

=====Single-level architecture=====
The model supported by x86 based systems. In this model, everything must go back to the main host hypervisor at the L0 level. For instance, if the host hypervisor (L0) runs L1, when L1 attempts to run its own virtual machine L2, this will trigger a trap that goes down to L0. Then L0 sends the result of the requested instruction back to L1. Generally, a trap at level Ln will be handled by the host hypervisor at level L0 and then the resulting emulated instruction goes back to Ln.

===The uses of nested virtualization===

====Compatibility====
A user can run a particular application or OS thats not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibily mode of other operating systems or applications, an example of this is the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware.

====Security====
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is basically a hollow program or network that appears to be functioning to outside users, but in reality, its only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machines and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored.

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

=Research problem=

Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have also recently been software based solutions [5], however these solutions suffer from significant performance problems.

The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the numer of control switches between different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it potentially in the worst case reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.

Generally, solutions that requie architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It is for the most part able to contain the problem of a single nested trap expanding into many more trap instructions, at least for the nesting depth the authors considered, which allows efficient virtualization without architectural support.

More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example the paper Accountable Virtual Machines wraps programs around a particular state VM which could most definitely be placed on a separate hypervisor for ideal isolation.

==Theory==

==CPU Virtualization==
How does the Nest VMX virtualization work for the turtle project: L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle. The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously. -csulliva

==Memory virtualization==

How Multi-dimensional paging work on the turtle project. The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

==I/O virtualization==

How does I/O virtualization work on the turtle project? There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

==Macro optimizations==
How they implement the Micro-Optimizations to make it go faster on the turtle project? The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

=Critique=

=== The pros ===

The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers, and should not have too large of an effect on an end user running an application in a nested virtual machine. This is especially true if the user is using the system at a low depth. One can further argue that the most common use cases for nested virtualization that the authors mention in section 1, such as virtualizing OSs that are already hypervisors (like windows 7) and hypervisors in the cloud, will be at a shallow depth. It then follows that the testing the authors do in section 4 covers the most common use cases, so users can expect similar impressive performance. Nevertheless contribution is visible with respect to security and compatibility. On the security side, this nested virtualization technique can be used to study hypervisor level rootkits, such as bluepill [6], by hosting an infected hypervisor as a guest on top of another hypervisor. Since this is the first successful implementation of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in the nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.

=== The cons ===

The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization enters our lives. The performance hit is mainly imposed by exponentially generated exits which are caused when a nested guest traps, handing control to the lowest level hypervisor, which may hand off the trap to hypervisors above it before finally returning to the guest. Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5. This is because it can be difficult to predict how the system will react to large levels of nesting, because the increase in the number of traps and other performance killing problems can potentially be exponential as the nesting gets deeper. Another significant detriment is that the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5: "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested". This means that some of the techniques the authors use to increase performance are not reproducible on other systems, and so the generality of parts of their solution may be limited.

=== The style and presentation ===

The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does a good job conveying the technical details. The paper does seem to assume a high level of background knowledge and familiarity with the subject, especially with some more technical points of the architecture the hardware uses to implement virtualization. For example, the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1, which may not be familiar to people not used to the technical language. The paper does, however, touch on a wide range of topics in the field of virtualization, including CPU, Memory and IO device virtualization. This wide scope means that many of the major components of virtualization are discussed, so in the process of understanding the paper one learns a lot about many different parts of the field.

=== Conclusion ===

Bottom line, the research showed in the paper is the first to achieve efficient x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

=References=
[1] Tanenbaum, Andrew (2007).'' Modern Operating Systems (3rd edition)'', page 569.

[2] Popek & Goldberg (1974). [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CCkQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.141.4815%26rep%3Drep1%26type%3Dpdf&ei=uxD4TL_OOYeSswbbydzZCA&usg=AFQjCNEavbxNIe4sUwidBvE_3S8MXY3fHg&sig2=BS1tG9eadLRrKVItvb6gBg ''Formal requirements for virtualizable 3rd Generation architecture, section 1: Virtual machine concepts'' ]

[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.

[4] Goldberg, P. [http://portal.acm.org/citation.cfm?id=800122.803950 Architecture of Virtual Machines]. In ''Proceedings of the Workshop on Virtual Computer Systems'', ACM pp. 74-112

[5] Berghmans, O. Nesting Virtual Machines in Virtualization Test Frameworks. Master's Thesis, Unversity of Antwerp, 2010.

[6] Presentation by Joanna Rutkowska, Black Hat Briefings 2006.

COMP 3000 Essay 2 2010 Question 9

2010-12-03T03:02:02Z

Mbingham: /* The style and presentation */ some more explanation

'''''Go to discussion for group members confirmation, general talk and paper discussions.'''''

=Paper=

<center><big><big>'''"The Turtles Project: Design and Implementation of Nested Virtualization"'''</big></big></center>

'''Authors:'''
* Muli Ben-Yehuday +
* Michael D. Day ++
* Zvi Dubitzky +
* Michael Factor +
* Nadav Har’El +
* Abel Gordon +
* Anthony Liguori ++
* Orit Wasserman +
* Ben-Ami Yassour +

'''Research labs:'''

+ IBM Research – Haifa

++ IBM Linux Technology Center

'''Website:''' http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

'''Video presentation:''' http://www.usenix.org/multimedia/osdi10ben-yehuda [Note: username and password are required for entry]

=Background Concepts=

Before we delve into the details of our research paper, its essential that we provide some insight and background to the concepts
and notions discussed by the authors.

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest virtual machine the illusion that its running on the bare hardware. But the reality is, we're actually running the virtual machine as an application on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guests among one another, and with the host hardware and operating system. It also controls host resources. [2]

====Nested virtualization====
The concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.|left|400px]]

====Protection rings====
In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3.
Ring 0 is the most privilged level, allowing access to the bare hardware components. The operating system kernel must
execute in Ring 0 in order to access the hardware and secure control. User programs execute in Ring 3. Ring 1 and Ring 2 are dedciated to device drivers and other operations.

In virtualization, the host hypervisor executes in Ring 0. While the guest virtual machine should normally executes in Ring 3, when the guest attempts to access privileged hardware components, the hypervisor comes into play and handles the situation and eventually runs the guest in Ring 0.

====Para-virtualization====
A virtualization model that requires the guest OS kernel to be modified in order to have some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, it rather relies on a software interface that we must implement in the guest kernel so that it can have some privileged hardware access via special instructions called hypercalls. The advantage here is that we have less environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]

===Models of virtualization===

=====Trap and emulate model=====
The trap and emualte model is based on the idea that when a guest hypervisor attempts to execute higher level instructions or access privileged hardware components, it triggers a trap or a fault which gets handled or caught by the host hypervisor. Based on the hardware model of virtualization support, the host hypervisor (L0) then determines whether it should handle the trap or whether it should forwards it to the the responsible parent of that guest hypervisor at a higher level.

====Models of hardware support====

=====Multiple-level architecture=====
Every parent hypervisor handles every other hypervisor running on top of it. For instance, assume that L0 (host hypervisor) runs the VM L1. When L1 attempts to execute a privileged instruction and a trap occurs, then the parent of L1, which is L0 in this case, will handle the trap. If L1 runs L2, and L2 attempts to execute privileged instructions as well, then L1 will act as the trap handler. More generally, every parent hypervisor at level Ln will act as a trap handler for its guest VM at level Ln+1. This model is not supported by the x86 based systems that are discussed in our research paper.

=====Single-level architecture=====
The model supported by x86 based systems. In this model, everything must go back to the main host hypervisor at the L0 level. For instance, if the host hypervisor (L0) runs L1, when L1 attempts to run its own virtual machine L2, this will trigger a trap that goes down to L0. Then L0 sends the result of the requested instruction back to L1. Generally, a trap at level Ln will be handled by the host hypervisor at level L0 and then the resulting emulated instruction goes back to Ln.

===The uses of nested virtualization===

====Compatibility====
A user can run a particular application or OS thats not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibily mode of other operating systems or applications, an example of this is the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware.

====Security====
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is basically a hollow program or network that appears to be functioning to outside users, but in reality, its only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machines and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored.

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

=Research problem=

Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have also recently been software based solutions [5], however these solutions suffer from significant performance problems.

The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the numer of control switches between different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it potentially in the worst case reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.

Generally, solutions that requie architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It is for the most part able to contain the problem of a single nested trap expanding into many more trap instructions, at least for the nesting depth the authors considered, which allows efficient virtualization without architectural support.

More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example the paper Accountable Virtual Machines wraps programs around a particular state VM which could most definitely be placed on a separate hypervisor for ideal isolation.

==Theory==

==CPU Virtualization==
How does the Nest VMX virtualization work for the turtle project: L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle. The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously. -csulliva

==Memory virtualization==

How Multi-dimensional paging work on the turtle project. The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

==I/O virtualization==

How does I/O virtualization work on the turtle project? There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

==Macro optimizations==
How they implement the Micro-Optimizations to make it go faster on the turtle project? The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

=Critique=

=== The pros ===

The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers, and should not have too large of an effect on an end user running an application in a nested virtual machine. This is especially true if the user is using the system at a low depth. One can further argue that the most common use cases for nested virtualization that the authors mention in section 1, such as virtualizing OSs that are already hypervisors (like windows 7) and hypervisors in the cloud, will be at a shallow depth. It then follows that the testing the authors do in section 4 covers the most common use cases, so users can expect similar impressive performance. Nevertheless contribution is visible with respect to security and compatibility. On the security side, this nested virtualization technique can be used to study hypervisor level rootkits, such as bluepill [6], by hosting an infected hypervisor as a guest on top of another hypervisor. Since this is the first successful implementation of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in the nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.

=== The cons ===

The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization enters our lives. The performance hit is mainly imposed by exponentially generated exits which are caused when a nested guest traps, handing control to the lowest level hypervisor, which may hand off the trap to hypervisors above it before finally returning to the guest. Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5. This is because it can be difficult to predict how the system will react to large levels of nesting, because the increase in the number of traps and other performance killing problems can potentially be exponential as the nesting gets deeper. Another significant detriment is that the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5: "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested".

=== The style and presentation ===

The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does a good job conveying the technical details. The paper does seem to assume a high level of background knowledge and familiarity with the subject, especially with some more technical points of the architecture the hardware uses to implement virtualization. For example, the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1, which may not be familiar to people not used to the technical language. The paper does, however, touch on a wide range of topics in the field of virtualization, including CPU, Memory and IO device virtualization. This wide scope means that many of the major components of virtualization are discussed, so in the process of understanding the paper one learns a lot about many different parts of the field.

=== Conclusion ===

Bottom line, the research showed in the paper is the first to achieve efficient x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

=References=
[1] Tanenbaum, Andrew (2007).'' Modern Operating Systems (3rd edition)'', page 569.

[2] Popek & Goldberg (1974). [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CCkQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.141.4815%26rep%3Drep1%26type%3Dpdf&ei=uxD4TL_OOYeSswbbydzZCA&usg=AFQjCNEavbxNIe4sUwidBvE_3S8MXY3fHg&sig2=BS1tG9eadLRrKVItvb6gBg ''Formal requirements for virtualizable 3rd Generation architecture, section 1: Virtual machine concepts'' ]

[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.

[4] Goldberg, P. [http://portal.acm.org/citation.cfm?id=800122.803950 Architecture of Virtual Machines]. In ''Proceedings of the Workshop on Virtual Computer Systems'', ACM pp. 74-112

[5] Berghmans, O. Nesting Virtual Machines in Virtualization Test Frameworks. Master's Thesis, Unversity of Antwerp, 2010.

[6] Presentation by Joanna Rutkowska, Black Hat Briefings 2006.

COMP 3000 Essay 2 2010 Question 9

2010-12-03T02:36:42Z

Mbingham: /* The pros */ added more explanation

'''''Go to discussion for group members confirmation, general talk and paper discussions.'''''

=Paper=

<center><big><big>'''"The Turtles Project: Design and Implementation of Nested Virtualization"'''</big></big></center>

'''Authors:'''
* Muli Ben-Yehuday +
* Michael D. Day ++
* Zvi Dubitzky +
* Michael Factor +
* Nadav Har’El +
* Abel Gordon +
* Anthony Liguori ++
* Orit Wasserman +
* Ben-Ami Yassour +

'''Research labs:'''

+ IBM Research – Haifa

++ IBM Linux Technology Center

'''Website:''' http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

'''Video presentation:''' http://www.usenix.org/multimedia/osdi10ben-yehuda [Note: username and password are required for entry]

=Background Concepts=

Before we delve into the details of our research paper, its essential that we provide some insight and background to the concepts
and notions discussed by the authors.

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest operating system the illusion that its running on the bare hardware. But the reality is, we're actually running the virtual machine as an application on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guests among one another, and with the host hardware and operating system. It also controls host resources. [2]

====Nested virtualization====
The concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.|left|400px]]

====Protection rings====
In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3.
Ring 0 is the most privilged level, allowing access to the bare hardware components. The operating system kernel must
execute in Ring 0 in order to access the hardware and secure control. User programs execute in Ring 3. Ring 1 and Ring 2 are dedciated to device drivers and other operations.

In virtualization, the host hypervisor executes in Ring 0. While the guest virtual machine should normally executes in Ring 3, when the guest attempts to access privileged hardware components, the hypervisor comes into play and handles the situation and eventually runs the guest in Ring 0.

====Para-virtualization====
A virtualization model that requires the guest OS kernel to be modified in order to have some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, it rather relies on a software interface that we must implement in the guest kernel so that it can have some privileged hardware access via special instructions called hypercalls. The advantage here is that we have less environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]

===Models of virtualization===

=====Trap and emulate model=====
The trap and emualte model is based on the idea that when a guest hypervisor attempts to execute higher level instructions or access privileged hardware components, it triggers a trap or a fault which gets handled or caught by the host hypervisor. Based on the hardware model of virtualization support, the host hypervisor (L0) then determines whether it should handle the trap or whether it should forwards it to the the responsible parent of that guest hypervisor at a higher level.

====Models of hardware support====

=====Multiple-level architecture=====
Every parent hypervisor handles every other hypervisor running on top of it. For instance, assume that L0 (host hypervisor) runs the VM L1. When L1 attempts to execute a privileged instruction and a trap occurs, then the parent of L1, which is L0 in this case, will handle the trap. If L1 runs L2, and L2 attempts to execute privileged instructions as well, then L1 will act as the trap handler. More generally, every parent hypervisor at level Ln will act as a trap handler for its guest VM at level Ln+1. This model is not supported by the x86 based systems that are discussed in our research paper.

=====Single-level architecture=====
The model supported by x86 based systems. In this model, everything must go back to the main host hypervisor at the L0 level. For instance, if the host hypervisor (L0) runs L1, when L1 attempts to run its own virtual machine L2, this will trigger a trap that goes down to L0. Then L0 sends the result of the requested instruction back to L1. Generally, a trap at level Ln will be handled by the host hypervisor at level L0 and then the resulting emulated instruction goes back to Ln.

===The uses of nested virtualization===

====Compatibility====
A user can run a particular application or OS thats not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibily mode of other operating systems or applications, an example of this is the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware.

====Security====
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is basically a hollow program or network that appears to be functioning to outside users, but in reality, its only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machines and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored.

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

=Research problem=

Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have also recently been software based solutions [5], however these solutions suffer from significant performance problems.

The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the numer of control switches between different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it potentially in the worst case reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.

Generally, solutions that requie architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It is for the most part able to contain the problem of a single nested trap expanding into many more trap instructions, at least for the nesting depth the authors considered, which allows efficient virtualization without architectural support.

More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example the paper Accountable Virtual Machines wraps programs around a particular state VM which could most definitely be placed on a separate hypervisor for ideal isolation.

==Theory==

==CPU Virtualization==
How does the Nest VMX virtualization work for the turtle project: L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle. The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously. -csulliva

==Memory virtualization==

How Multi-dimensional paging work on the turtle project. The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

==I/O virtualization==

How does I/O virtualization work on the turtle project? There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

==Macro optimizations==
How they implement the Micro-Optimizations to make it go faster on the turtle project? The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

=Critique=

=== The pros ===

The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers, and should not have too large of an effect on an end user running an application in a nested virtual machine. This is especially true if the user is using the system at a low depth. One can further argue that the most common use cases for nested virtualization that the authors mention in section 1, such as virtualizing OSs that are already hypervisors (like windows 7) and hypervisors in the cloud, will be at a shallow depth. It then follows that the testing the authors do in section 4 covers the most common use cases, so users can expect similar impressive performance. Nevertheless contribution is visible with respect to security and compatibility. On the security side, this nested virtualization technique can be used to study hypervisor level rootkits, such as bluepill [6], by hosting an infected hypervisor as a guest on top of another hypervisor. Since this is the first successful implementation of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in the nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.

=== The cons ===

The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization enters our lives. The performance hit is mainly imposed by exponentially generated exits which are caused when a nested guest traps, handing control to the lowest level hypervisor, which may hand off the trap to hypervisors above it before finally returning to the guest. Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5. This is because it can be difficult to predict how the system will react to large levels of nesting, because the increase in the number of traps and other performance killing problems can potentially be exponential as the nesting gets deeper. Another significant detriment is that the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5: "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested".

=== The style and presentation ===

The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does a good job to convey the technical details. Depending on the level of enlightenment towards the background knowledge it appears very complex and personally it required quite some research before my fully delving into the theory of the design. For instance the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1. All in all, the provided video highly in depth increased my awareness in the subject of nested hypervisors.

=== Conclusion ===

Bottom line, the research showed in the paper is the first to achieve efficient x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

=References=
[1] Tanenbaum, Andrew (2007).'' Modern Operating Systems (3rd edition)'', page 569.

[2] Popek & Goldberg (1974). [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CCkQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.141.4815%26rep%3Drep1%26type%3Dpdf&ei=uxD4TL_OOYeSswbbydzZCA&usg=AFQjCNEavbxNIe4sUwidBvE_3S8MXY3fHg&sig2=BS1tG9eadLRrKVItvb6gBg ''Formal requirements for virtualizable 3rd Generation architecture, section 1: Virtual machine concepts'' ]

[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.

[4] Goldberg, P. [http://portal.acm.org/citation.cfm?id=800122.803950 Architecture of Virtual Machines]. In ''Proceedings of the Workshop on Virtual Computer Systems'', ACM pp. 74-112

[5] Berghmans, O. Nesting Virtual Machines in Virtualization Test Frameworks. Master's Thesis, Unversity of Antwerp, 2010.

[6] Presentation by Joanna Rutkowska, Black Hat Briefings 2006.

COMP 3000 Essay 2 2010 Question 9

2010-12-03T02:24:06Z

Mbingham: /* The cons */ some more explanation

'''''Go to discussion for group members confirmation, general talk and paper discussions.'''''

=Paper=

<center><big><big>'''"The Turtles Project: Design and Implementation of Nested Virtualization"'''</big></big></center>

'''Authors:'''
* Muli Ben-Yehuday +
* Michael D. Day ++
* Zvi Dubitzky +
* Michael Factor +
* Nadav Har’El +
* Abel Gordon +
* Anthony Liguori ++
* Orit Wasserman +
* Ben-Ami Yassour +

'''Research labs:'''

+ IBM Research – Haifa

++ IBM Linux Technology Center

'''Website:''' http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

'''Video presentation:''' http://www.usenix.org/multimedia/osdi10ben-yehuda [Note: username and password are required for entry]

=Background Concepts=

Before we delve into the details of our research paper, its essential that we provide some insight and background to the concepts
and notions discussed by the authors.

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest operating system the illusion that its running on the bare hardware. But the reality is, we're actually running the virtual machine as an application on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guests among one another, and with the host hardware and operating system. It also controls host resources. [2]

====Nested virtualization====
The concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.|left|400px]]

====Protection rings====
In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3.
Ring 0 is the most privilged level, allowing access to the bare hardware components. The operating system kernel must
execute in Ring 0 in order to access the hardware and secure control. User programs execute in Ring 3. Ring 1 and Ring 2 are dedciated to device drivers and other operations.

In virtualization, the host hypervisor executes in Ring 0. While the guest virtual machine should normally executes in Ring 3, when the guest attempts to access privileged hardware components, the hypervisor comes into play and handles the situation and eventually runs the guest in Ring 0.

====Para-virtualization====
A virtualization model that requires the guest OS kernel to be modified in order to have some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, it rather relies on a software interface that we must implement in the guest kernel so that it can have some privileged hardware access via special instructions called hypercalls. The advantage here is that we have less environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]

===Models of virtualization===

=====Trap and emulate model=====
The trap and emualte model is based on the idea that when a guest hypervisor attempts to execute higher level instructions or access privileged hardware components, it triggers a trap or a fault which gets handled or caught by the host hypervisor. Based on the hardware model of virtualization support, the host hypervisor (L0) then determines whether it should handle the trap or whether it should forwards it to the the responsible parent of that guest hypervisor at a higher level.

====Models of hardware support====

=====Multiple-level architecture=====
Every parent hypervisor handles every other hypervisor running on top of it. For instance, assume that L0 (host hypervisor) runs the VM L1. When L1 attempts to execute a privileged instruction and a trap occurs, then the parent of L1, which is L0 in this case, will handle the trap. If L1 runs L2, and L2 attempts to execute privileged instructions as well, then L1 will act as the trap handler. More generally, every parent hypervisor at level Ln will act as a trap handler for its guest VM at level Ln+1. This model is not supported by the x86 based systems that are discussed in our research paper.

=====Single-level architecture=====
The model supported by x86 based systems. In this model, everything must go back to the main host hypervisor at the L0 level. For instance, if the host hypervisor (L0) runs L1, when L1 attempts to run its own virtual machine L2, this will trigger a trap that goes down to L0. Then L0 sends the result of the requested instruction back to L1. Generally, a trap at level Ln will be handled by the host hypervisor at level L0 and then the resulting emulated instruction goes back to Ln.

===The uses of nested virtualization===

====Compatibility====
A user can run a particular application or OS thats not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibily mode of other operating systems or applications, an example of this is the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware.

====Security====
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is basically a hollow program or network that appears to be functioning to outside users, but in reality, its only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machines and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored.

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

=Research problem=

Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have also recently been software based solutions [5], however these solutions suffer from significant performance problems.

The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the numer of control switches between different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it potentially in the worst case reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.

Generally, solutions that requie architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It is for the most part able to contain the problem of a single nested trap expanding into many more trap instructions, at least for the nesting depth the authors considered, which allows efficient virtualization without architectural support.

More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example the paper Accountable Virtual Machines wraps programs around a particular state VM which could most definitely be placed on a separate hypervisor for ideal isolation.

==Theory==

==CPU Virtualization==
How does the Nest VMX virtualization work for the turtle project: L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle. The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously. -csulliva

==Memory virtualization==

How Multi-dimensional paging work on the turtle project. The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

==I/O virtualization==

How does I/O virtualization work on the turtle project? There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

==Macro optimizations==
How they implement the Micro-Optimizations to make it go faster on the turtle project? The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

=Critique=

=== The pros ===

The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers and does not affect the end-user in clearly detectable deviation regarding the usage of applications on top this architecture. Nevertheless contribution is visible with respect to security and compatibility. On the security side, this nested virtualization technique can be used to study hypervisor level rootkits, such as bluepill [6], by hosting an infected hypervisor as a guest on top of another hypervisor. Since this is the first successful implementation of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in the nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.

=== The cons ===

The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization enters our lives. The performance hit is mainly imposed by exponentially generated exits which are caused when a nested guest traps, handing control to the lowest level hypervisor, which may hand off the trap to hypervisors above it before finally returning to the guest. Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5. This is because it can be difficult to predict how the system will react to large levels of nesting, because the increase in the number of traps and other performance killing problems can potentially be exponential as the nesting gets deeper. Another significant detriment is that the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5: "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested".

=== The style and presentation ===

The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does a good job to convey the technical details. Depending on the level of enlightenment towards the background knowledge it appears very complex and personally it required quite some research before my fully delving into the theory of the design. For instance the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1. All in all, the provided video highly in depth increased my awareness in the subject of nested hypervisors.

=== Conclusion ===

Bottom line, the research showed in the paper is the first to achieve efficient x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

=References=
[1] Tanenbaum, Andrew (2007).'' Modern Operating Systems (3rd edition)'', page 569.

[2] Popek & Goldberg (1974). [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CCkQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.141.4815%26rep%3Drep1%26type%3Dpdf&ei=uxD4TL_OOYeSswbbydzZCA&usg=AFQjCNEavbxNIe4sUwidBvE_3S8MXY3fHg&sig2=BS1tG9eadLRrKVItvb6gBg ''Formal requirements for virtualizable 3rd Generation architecture, section 1: Virtual machine concepts'' ]

[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.

[4] Goldberg, P. [http://portal.acm.org/citation.cfm?id=800122.803950 Architecture of Virtual Machines]. In ''Proceedings of the Workshop on Virtual Computer Systems'', ACM pp. 74-112

[5] Berghmans, O. Nesting Virtual Machines in Virtualization Test Frameworks. Master's Thesis, Unversity of Antwerp, 2010.

[6] Presentation by Joanna Rutkowska, Black Hat Briefings 2006.

COMP 3000 Essay 2 2010 Question 9

2010-12-03T02:11:56Z

Mbingham: /* The cons */ adding another sentence of explanation

'''''Go to discussion for group members confirmation, general talk and paper discussions.'''''

=Paper=

<center><big><big>'''"The Turtles Project: Design and Implementation of Nested Virtualization"'''</big></big></center>

'''Authors:'''
* Muli Ben-Yehuday +
* Michael D. Day ++
* Zvi Dubitzky +
* Michael Factor +
* Nadav Har’El +
* Abel Gordon +
* Anthony Liguori ++
* Orit Wasserman +
* Ben-Ami Yassour +

'''Research labs:'''

+ IBM Research – Haifa

++ IBM Linux Technology Center

'''Website:''' http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

'''Video presentation:''' http://www.usenix.org/multimedia/osdi10ben-yehuda [Note: username and password are required for entry]

=Background Concepts=

Before we delve into the details of our research paper, its essential that we provide some insight and background to the concepts
and notions discussed by the authors.

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest operating system the illusion that its running on the bare hardware. But the reality is, we're actually running the virtual machine as an application on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guests among one another, and with the host hardware and operating system. It also controls host resources. [2]

====Nested virtualization====
The concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.|left|400px]]

====Para-virtualization====
A virtualization model that requires the guest OS kernel to be modified in order to have some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, it rather relies on a software interface that we must implement in the guest kernel so that it can have some privileged hardware access via special instructions called hypercalls. The advantage here is that we have less environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]

===Models of virtualization===

=====Trap and emulate model=====
The trap and emualte model is based on the idea that when a guest hypervisor attempts to execute higher level instructions or access privileged hardware components, it triggers a trap or a fault which gets handled or caught by the host hypervisor. Based on the hardware model of virtualization support, the host hypervisor (L0) then determines whether it should handle the trap or whether it should forwards it to the the responsible parent of that guest hypervisor at a higher level.

====Models of hardware support====

=====Multiple-level architecture=====
Every parent hypervisor handles every other hypervisor running on top of it. For instance, assume that L0 (host hypervisor) runs the VM L1. When L1 attempts to execute a privileged instruction and a trap occurs, then the parent of L1, which is L0 in this case, will handle the trap. If L1 runs L2, and L2 attempts to execute privileged instructions as well, then L1 will act as the trap handler. More generally, every parent hypervisor at level Ln will act as a trap handler for its guest VM at level Ln+1. This model is not supported by the x86 based systems that are discussed in our research paper.

=====Single-level architecture=====
The model supported by x86 based systems. In this model, everything must go back to the main host hypervisor at the L0 level. For instance, if the host hypervisor (L0) runs L1, when L1 attempts to run its own virtual machine L2, this will trigger a trap that goes down to L0. Then L0 sends the result of the requested instruction back to L1. Generally, a trap at level Ln will be handled by the host hypervisor at level L0 and then the resulting emulated instruction goes back to Ln.

===The uses of nested virtualization===

====Compatibility====
A user can run a particular application or OS thats not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibily mode of other operating systems or applications, an example of this is the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware.

====Security====
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is basically a hollow program or network that appears to be functioning to outside users, but in reality, its only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machines and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored.

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

=Research problem=

Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have also recently been software based solutions [5], however these solutions suffer from significant performance problems.

The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the numer of control switches between different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it potentially in the worst case reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.

Generally, solutions that requie architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It is for the most part able to contain the problem of a single nested trap expanding into many more trap instructions, at least for the nesting depth the authors considered, which allows efficient virtualization without architectural support.

More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example the paper Accountable Virtual Machines wraps programs around a particular state VM which could most definitely be placed on a separate hypervisor for ideal isolation.

==Theory==

==CPU Virtualization==
How does the Nest VMX virtualization work for the turtle project: L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle. The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously. -csulliva

==Memory virtualization==

How Multi-dimensional paging work on the turtle project. The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

==I/O virtualization==

How does I/O virtualization work on the turtle project? There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

==Macro optimizations==
How they implement the Micro-Optimizations to make it go faster on the turtle project? The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

=Critique=

=== The pros ===

The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers and does not affect the end-user in clearly detectable deviation regarding the usage of applications on top this architecture. Nevertheless contribution is visible with respect to security and compatibility. On the security side, this nested virtualization technique can be used to study hypervisor level rootkits, such as bluepill [6], by hosting an infected hypervisor as a guest on top of another hypervisor. Since this is the first successful implementation of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in the nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.

=== The cons ===

The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization enters our lives. The performance hit is mainly imposed by exponentially generated exits which are caused when a nested guest traps, handing control to the lowest level hypervisor, which may hand off the trap to hypervisors above it before finally returning to the guest. Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5, just to see what the effect is. Another significant detriment is that the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5: "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested".

=== The style and presentation ===

The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does a good job to convey the technical details. Depending on the level of enlightenment towards the background knowledge it appears very complex and personally it required quite some research before my fully delving into the theory of the design. For instance the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1. All in all, the provided video highly in depth increased my awareness in the subject of nested hypervisors.

=== Conclusion ===

Bottom line, the research showed in the paper is the first to achieve efficient x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

=References=
[1] Tanenbaum, Andrew (2007).'' Modern Operating Systems (3rd edition)'', page 569.

[2] Popek & Goldberg (1974). [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CCkQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.141.4815%26rep%3Drep1%26type%3Dpdf&ei=uxD4TL_OOYeSswbbydzZCA&usg=AFQjCNEavbxNIe4sUwidBvE_3S8MXY3fHg&sig2=BS1tG9eadLRrKVItvb6gBg ''Formal requirements for virtualizable 3rd Generation architecture, section 1: Virtual machine concepts'' ]

[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.

[4] Goldberg, P. [http://portal.acm.org/citation.cfm?id=800122.803950 Architecture of Virtual Machines]. In ''Proceedings of the Workshop on Virtual Computer Systems'', ACM pp. 74-112

[5] Berghmans, O. Nesting Virtual Machines in Virtualization Test Frameworks. Master's Thesis, Unversity of Antwerp, 2010.

[6] Presentation by Joanna Rutkowska, Black Hat Briefings 2006.

COMP 3000 Essay 2 2010 Question 9

2010-12-03T01:53:35Z

Mbingham: /* References */ adding reference

'''''Go to discussion for group members confirmation, general talk and paper discussions.'''''

=Paper=

<center><big><big>'''"The Turtles Project: Design and Implementation of Nested Virtualization"'''</big></big></center>

'''Authors:'''
* Muli Ben-Yehuday +
* Michael D. Day ++
* Zvi Dubitzky +
* Michael Factor +
* Nadav Har’El +
* Abel Gordon +
* Anthony Liguori ++
* Orit Wasserman +
* Ben-Ami Yassour +

'''Research labs:'''

+ IBM Research – Haifa

++ IBM Linux Technology Center

'''Website:''' http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

'''Video presentation:''' http://www.usenix.org/multimedia/osdi10ben-yehuda [Note: username and password are required for entry]

=Background Concepts=

Before we delve into the details of our research paper, its essential that we provide some insight and background to the concepts
and notions discussed by the authors.

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest operating system the illusion that its running on the bare hardware. But the reality is, we're actually running the virtual machine as an application on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guests among one another, and with the host hardware and operating system. It also controls host resources. [2]

====Nested virtualization====
The concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.|left|400px]]

====Para-virtualization====
A virtualization model that requires the guest OS kernel to be modified in order to have some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, it rather relies on a software interface that we must implement in the guest kernel so that it can have some privileged hardware access via special instructions called hypercalls. The advantage here is that we have less environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]

===Models of virtualization===

=====Trap and emulate model=====
The trap and emualte model is based on the idea that when a guest hypervisor attempts to execute higher level instructions or access privileged hardware components, it triggers a trap or a fault which gets handled or caught by the host hypervisor. Based on the hardware model of virtualization support, the host hypervisor (L0) then determines whether it should handle the trap or whether it should forwards it to the the responsible parent of that guest hypervisor at a higher level.

====Protection rings====
In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3.
Ring 0 is the most privilged level, allowing access to the bare hardware components. The operating system kernel must
execute in Ring 0 in order to access the hardware and secure control. User programs execute in Ring 3. Ring 1 and Ring 2 are dedciated to device drivers and other operations.

In virtualization, the host hypervisor executes in Ring 0. While the guest virtual machine should normally executes in Ring 3, when the guest triggers a trap and the trap is handled by the host hypervisor, the guest hypervisor ends up running in Ring 0.

====Models of hardware support====

=====Multiple-level architecture=====
Every parent hypervisor handles every other hypervisor running on top of it. For instance, assume that L0 (host hypervisor) runs the VM L1. When L1 attempts to execute a privileged instruction and a trap occurs, then the parent of L1, which is L0 in this case, will handle the trap. If L1 runs L2, and L2 attempts to execute privileged instructions as well, then L1 will act as the trap handler. More generally, every parent hypervisor at level Ln will act as a trap handler for its guest VM at level Ln+1. This model is not supported by the x86 based systems that are discussed in our research paper.

=====Single-level architecture=====
The model supported by x86 based systems. In this model, everything must go back to the main host hypervisor at the L0 level. For instance, if the host hypervisor (L0) runs L1, when L1 attempts to run its own virtual machine L2, this will trigger a trap that goes down to L0. Then L0 sends the result of the requested instruction back to L1. Generally, a trap at level Ln will be handled by the host hypervisor at level L0 and then the resulting emulated instruction goes back to Ln.

===The uses of nested virtualization===

====Compatibility====
A user can run an application thats not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibily mode of other operating systems or applications, an example of this is the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware.

====Security====
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is basically a hollow program or network that appears to be functioning to outside users, but in reality, its only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machines and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored.

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

=Research problem=

Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have also recently been software based solutions [5], however these solutions suffer from significant performance problems.

The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the numer of control switches between different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it potentially in the worst case reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.

Generally, solutions that requie architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It is for the most part able to contain the problem of a single nested trap expanding into many more trap instructions, at least for the nesting depth the authors considered, which allows efficient virtualization without architectural support.

More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example the paper Accountable Virtual Machines wraps programs around a particular state VM which could most definitely be placed on a separate hypervisor for ideal isolation.

==Theory==

==CPU Virtualization==
How does the Nest VMX virtualization work for the turtle project: L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle. The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously. -csulliva

==Memory virtualization==

How Multi-dimensional paging work on the turtle project. The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

==I/O virtualization==

How does I/O virtualization work on the turtle project? There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

==Macro optimizations==
How they implement the Micro-Optimizations to make it go faster on the turtle project? The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

=Critique=

=== The good ===

The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers and does not affect the end-user in clearly detectable deviation regarding the usage of applications on top this architecture. Nevertheless contribution is visible with respect to security and compatibility. On the security side, this nested virtualization technique can be used to study hypervisor level rootkits, such as bluepill [6], by hosting an infected hypervisor as a guest on top of another hypervisor. Since this is the first successful implementation of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in the nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.

=== The bad ===

The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization enters our lives. The performance hit is mainly imposed by exponentially generated exits. Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5, just to see what the effect is. Another significant detriment in the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5 "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested".

=== The style of paper ===

The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does a good job to convey the technical details. Depending on the level of enlightenment towards the background knowledge it appears very complex and personally it required quite some research before my fully delving into the theory of the design. For instance the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1. All in all, the provided video highly in depth increased my awareness in the subject of nested hypervisors.

=== Conclusion ===

Bottom line, the research showed in the paper is the first to achieve efficient x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

=References=
[1] Tanenbaum, Andrew (2007).'' Modern Operating Systems (3rd edition)'', page 569.

[2] Popek & Goldberg (1974). [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CCkQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.141.4815%26rep%3Drep1%26type%3Dpdf&ei=uxD4TL_OOYeSswbbydzZCA&usg=AFQjCNEavbxNIe4sUwidBvE_3S8MXY3fHg&sig2=BS1tG9eadLRrKVItvb6gBg ''Formal requirements for virtualizable 3rd Generation architecture, section 1: Virtual machine concepts'' ]

[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.

[4] Goldberg, P. [http://portal.acm.org/citation.cfm?id=800122.803950 Architecture of Virtual Machines]. In ''Proceedings of the Workshop on Virtual Computer Systems'', ACM pp. 74-112

[5] Berghmans, O. Nesting Virtual Machines in Virtualization Test Frameworks. Master's Thesis, Unversity of Antwerp, 2010.

[6] Presentation by Joanna Rutkowska, Black Hat Briefings 2006.

COMP 3000 Essay 2 2010 Question 9

2010-12-03T01:52:44Z

Mbingham: /* The good */ Added a few lines about security research

'''''Go to discussion for group members confirmation, general talk and paper discussions.'''''

=Paper=

<center><big><big>'''"The Turtles Project: Design and Implementation of Nested Virtualization"'''</big></big></center>

'''Authors:'''
* Muli Ben-Yehuday +
* Michael D. Day ++
* Zvi Dubitzky +
* Michael Factor +
* Nadav Har’El +
* Abel Gordon +
* Anthony Liguori ++
* Orit Wasserman +
* Ben-Ami Yassour +

'''Research labs:'''

+ IBM Research – Haifa

++ IBM Linux Technology Center

'''Website:''' http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

'''Video presentation:''' http://www.usenix.org/multimedia/osdi10ben-yehuda [Note: username and password are required for entry]

=Background Concepts=

Before we delve into the details of our research paper, its essential that we provide some insight and background to the concepts
and notions discussed by the authors.

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest operating system the illusion that its running on the bare hardware. But the reality is, we're actually running the virtual machine as an application on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guests among one another, and with the host hardware and operating system. It also controls host resources. [2]

====Nested virtualization====
The concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.|left|400px]]

====Para-virtualization====
A virtualization model that requires the guest OS kernel to be modified in order to have some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, it rather relies on a software interface that we must implement in the guest kernel so that it can have some privileged hardware access via special instructions called hypercalls. The advantage here is that we have less environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]

===Models of virtualization===

=====Trap and emulate model=====
The trap and emualte model is based on the idea that when a guest hypervisor attempts to execute higher level instructions or access privileged hardware components, it triggers a trap or a fault which gets handled or caught by the host hypervisor. Based on the hardware model of virtualization support, the host hypervisor (L0) then determines whether it should handle the trap or whether it should forwards it to the the responsible parent of that guest hypervisor at a higher level.

====Protection rings====
In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3.
Ring 0 is the most privilged level, allowing access to the bare hardware components. The operating system kernel must
execute in Ring 0 in order to access the hardware and secure control. User programs execute in Ring 3. Ring 1 and Ring 2 are dedciated to device drivers and other operations.

In virtualization, the host hypervisor executes in Ring 0. While the guest virtual machine should normally executes in Ring 3, when the guest triggers a trap and the trap is handled by the host hypervisor, the guest hypervisor ends up running in Ring 0.

====Models of hardware support====

=====Multiple-level architecture=====
Every parent hypervisor handles every other hypervisor running on top of it. For instance, assume that L0 (host hypervisor) runs the VM L1. When L1 attempts to execute a privileged instruction and a trap occurs, then the parent of L1, which is L0 in this case, will handle the trap. If L1 runs L2, and L2 attempts to execute privileged instructions as well, then L1 will act as the trap handler. More generally, every parent hypervisor at level Ln will act as a trap handler for its guest VM at level Ln+1. This model is not supported by the x86 based systems that are discussed in our research paper.

=====Single-level architecture=====
The model supported by x86 based systems. In this model, everything must go back to the main host hypervisor at the L0 level. For instance, if the host hypervisor (L0) runs L1, when L1 attempts to run its own virtual machine L2, this will trigger a trap that goes down to L0. Then L0 sends the result of the requested instruction back to L1. Generally, a trap at level Ln will be handled by the host hypervisor at level L0 and then the resulting emulated instruction goes back to Ln.

===The uses of nested virtualization===

====Compatibility====
A user can run an application thats not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibily mode of other operating systems or applications, an example of this is the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware.

====Security====
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is basically a hollow program or network that appears to be functioning to outside users, but in reality, its only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machines and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored.

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

=Research problem=

Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have also recently been software based solutions [5], however these solutions suffer from significant performance problems.

The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the numer of control switches between different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it potentially in the worst case reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.

Generally, solutions that requie architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It is for the most part able to contain the problem of a single nested trap expanding into many more trap instructions, at least for the nesting depth the authors considered, which allows efficient virtualization without architectural support.

More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example the paper Accountable Virtual Machines wraps programs around a particular state VM which could most definitely be placed on a separate hypervisor for ideal isolation.

==Theory==

==CPU Virtualization==
How does the Nest VMX virtualization work for the turtle project: L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle. The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously. -csulliva

==Memory virtualization==

How Multi-dimensional paging work on the turtle project. The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

==I/O virtualization==

How does I/O virtualization work on the turtle project? There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

==Macro optimizations==
How they implement the Micro-Optimizations to make it go faster on the turtle project? The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

=Critique=

=== The good ===

The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers and does not affect the end-user in clearly detectable deviation regarding the usage of applications on top this architecture. Nevertheless contribution is visible with respect to security and compatibility. On the security side, this nested virtualization technique can be used to study hypervisor level rootkits, such as bluepill [6], by hosting an infected hypervisor as a guest on top of another hypervisor. Since this is the first successful implementation of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in the nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.

=== The bad ===

The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization enters our lives. The performance hit is mainly imposed by exponentially generated exits. Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5, just to see what the effect is. Another significant detriment in the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5 "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested".

=== The style of paper ===

The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does a good job to convey the technical details. Depending on the level of enlightenment towards the background knowledge it appears very complex and personally it required quite some research before my fully delving into the theory of the design. For instance the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1. All in all, the provided video highly in depth increased my awareness in the subject of nested hypervisors.

=== Conclusion ===

Bottom line, the research showed in the paper is the first to achieve efficient x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

=References=
[1] Tanenbaum, Andrew (2007).'' Modern Operating Systems (3rd edition)'', page 569.

[2] Popek & Goldberg (1974). [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CCkQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.141.4815%26rep%3Drep1%26type%3Dpdf&ei=uxD4TL_OOYeSswbbydzZCA&usg=AFQjCNEavbxNIe4sUwidBvE_3S8MXY3fHg&sig2=BS1tG9eadLRrKVItvb6gBg ''Formal requirements for virtualizable 3rd Generation architecture, section 1: Virtual machine concepts'' ]

[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.

[4] Goldberg, P. [http://portal.acm.org/citation.cfm?id=800122.803950 Architecture of Virtual Machines]. In ''Proceedings of the Workshop on Virtual Computer Systems'', ACM pp. 74-112

[5] Berghmans, O. Nesting Virtual Machines in Virtualization Test Frameworks. Master's Thesis, Unversity of Antwerp, 2010.

COMP 3000 Essay 2 2010 Question 9

2010-12-03T01:40:12Z

Mbingham: /* The good */ some minor grammar stuff

'''''Go to discussion for group members confirmation, general talk and paper discussions.'''''

=Paper=

<center><big><big>'''"The Turtles Project: Design and Implementation of Nested Virtualization"'''</big></big></center>

'''Authors:'''
* Muli Ben-Yehuday +
* Michael D. Day ++
* Zvi Dubitzky +
* Michael Factor +
* Nadav Har’El +
* Abel Gordon +
* Anthony Liguori ++
* Orit Wasserman +
* Ben-Ami Yassour +

'''Research labs:'''

+ IBM Research – Haifa

++ IBM Linux Technology Center

'''Website:''' http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

'''Video presentation:''' http://www.usenix.org/multimedia/osdi10ben-yehuda [Note: username and password are required for entry]

=Background Concepts=

Before we delve into the details of our research paper, its essential that we provide some insight and background to the concepts
and notions discussed by the authors.

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest operating system the illusion that its running on the bare hardware. But the reality is, we're actually running the virtual machine as an application on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guests among one another, and with the host hardware and operating system. It also controls host resources. [2]

====Nested virtualization====
The concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.|left|400px]]

====Para-virtualization====
A virtualization model that requires the guest OS kernel to be modified in order to have some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, it rather relies on a software interface that we must implement in the guest kernel so that it can have some privileged hardware access via special instructions called hypercalls. The advantage here is that we have less environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]

===Models of virtualization===

=====Trap and emulate model=====
The trap and emualte model is based on the idea that when a guest hypervisor attempts to execute higher level instructions or access privileged hardware components, it triggers a trap or a fault which gets handled or caught by the host hypervisor. Based on the hardware model of virtualization support, the host hypervisor (L0) then determines whether it should handle the trap or whether it should forwards it to the the responsible parent of that guest hypervisor at a higher level.

====Protection rings====
In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3.
Ring 0 is the most privilged level, allowing access to the bare hardware components. The operating system kernel must
execute in Ring 0 in order to access the hardware and secure control. User programs execute in Ring 3. Ring 1 and Ring 2 are dedciated to device drivers and other operations.

In virtualization, the host hypervisor executes in Ring 0. While the guest virtual machine should normally executes in Ring 3, when the guest triggers a trap and the trap is handled by the host hypervisor, the guest hypervisor ends up running in Ring 0.

====Models of hardware support====

=====Multiple-level architecture=====
Every parent hypervisor handles every other hypervisor running on top of it. For instance, assume that L0 (host hypervisor) runs the VM L1. When L1 attempts to execute a privileged instruction and a trap occurs, then the parent of L1, which is L0 in this case, will handle the trap. If L1 runs L2, and L2 attempts to execute privileged instructions as well, then L1 will act as the trap handler. More generally, every parent hypervisor at level Ln will act as a trap handler for its guest VM at level Ln+1. This model is not supported by the x86 based systems that are discussed in our research paper.

=====Single-level architecture=====
The model supported by x86 based systems. In this model, everything must go back to the main host hypervisor at the L0 level. For instance, if the host hypervisor (L0) runs L1, when L1 attempts to run its own virtual machine L2, this will trigger a trap that goes down to L0. Then L0 sends the result of the requested instruction back to L1. Generally, a trap at level Ln will be handled by the host hypervisor at level L0 and then the resulting emulated instruction goes back to Ln.

===The uses of nested virtualization===

====Compatibility====
A user can run an application thats not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibily mode of other operating systems or applications, an example of this is the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware.

====Security====
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is basically a hollow program or network that appears to be functioning to outside users, but in reality, its only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machines and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored.

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

=Research problem=

Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have also recently been software based solutions [5], however these solutions suffer from significant performance problems.

The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the numer of control switches between different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it potentially in the worst case reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.

Generally, solutions that requie architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It is for the most part able to contain the problem of a single nested trap expanding into many more trap instructions, at least for the nesting depth the authors considered, which allows efficient virtualization without architectural support.

More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example the paper Accountable Virtual Machines wraps programs around a particular state VM which could most definitely be placed on a separate hypervisor for ideal isolation.

==Theory==

==CPU Virtualization==
How does the Nest VMX virtualization work for the turtle project: L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle. The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously. -csulliva

==Memory virtualization==

How Multi-dimensional paging work on the turtle project. The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

==I/O virtualization==

How does I/O virtualization work on the turtle project? There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

==Macro optimizations==
How they implement the Micro-Optimizations to make it go faster on the turtle project? The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

=Critique=

=== The good ===

The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers and does not affect the end-user in clearly detectable deviation regarding the usage of applications on top this architecture. Nevertheless contribution is visible with respect to security and compatibility. Since this is the first successful implementation of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in the nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.

=== The bad ===

The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization enters our lives. The performance hit is mainly imposed by exponentially generated exits. Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5, just to see what the effect is. Another significant detriment in the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5 "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested".

=== The style of paper ===

The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does a good job to convey the technical details. Depending on the level of enlightenment towards the background knowledge it appears very complex and personally it required quite some research before my fully delving into the theory of the design. For instance the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1. All in all, the provided video highly in depth increased my awareness in the subject of nested hypervisors.

=== Conclusion ===

Bottom line, the research showed in the paper is the first to achieve efficient x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

=References=
[1] Tanenbaum, Andrew (2007).'' Modern Operating Systems (3rd edition)'', page 569.

[2] Popek & Goldberg (1974). [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CCkQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.141.4815%26rep%3Drep1%26type%3Dpdf&ei=uxD4TL_OOYeSswbbydzZCA&usg=AFQjCNEavbxNIe4sUwidBvE_3S8MXY3fHg&sig2=BS1tG9eadLRrKVItvb6gBg ''Formal requirements for virtualizable 3rd Generation architecture, section 1: Virtual machine concepts'' ]

[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.

[4] Goldberg, P. [http://portal.acm.org/citation.cfm?id=800122.803950 Architecture of Virtual Machines]. In ''Proceedings of the Workshop on Virtual Computer Systems'', ACM pp. 74-112

[5] Berghmans, O. Nesting Virtual Machines in Virtualization Test Frameworks. Master's Thesis, Unversity of Antwerp, 2010.

COMP 3000 Essay 2 2010 Question 9

2010-12-03T01:36:03Z

Mbingham: /* Research problem */

'''''Go to discussion for group members confirmation, general talk and paper discussions.'''''

=Paper=

<center><big><big>'''"The Turtles Project: Design and Implementation of Nested Virtualization"'''</big></big></center>

'''Authors:'''
* Muli Ben-Yehuday +
* Michael D. Day ++
* Zvi Dubitzky +
* Michael Factor +
* Nadav Har’El +
* Abel Gordon +
* Anthony Liguori ++
* Orit Wasserman +
* Ben-Ami Yassour +

'''Research labs:'''

+ IBM Research – Haifa

++ IBM Linux Technology Center

'''Website:''' http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

'''Video presentation:''' http://www.usenix.org/multimedia/osdi10ben-yehuda [Note: username and password are required for entry]

=Background Concepts=

Before we delve into the details of our research paper, its essential that we provide some insight and background to the concepts
and notions discussed by the authors.

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest operating system the illusion that its running on the bare hardware. But the reality is, we're actually running the virtual machine as an application on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guests among one another, and with the host hardware and operating system. It also controls host resources. [2]

====Nested virtualization====
The concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.|left|400px]]

====Para-virtualization====
A virtualization model that requires the guest OS kernel to be modified in order to have some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, it rather relies on a software interface that we must implement in the guest kernel so that it can have some privileged hardware access via special instructions called hypercalls. The advantage here is that we have less environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]

===Models of virtualization===

=====Trap and emulate model=====
The trap and emualte model is based on the idea that when a guest hypervisor attempts to execute higher level instructions or access privileged hardware components, it triggers a trap or a fault which gets handled or caught by the host hypervisor. Based on the hardware model of virtualization support, the host hypervisor (L0) then determines whether it should handle the trap or whether it should forwards it to the the responsible parent of that guest hypervisor at a higher level.

====Protection rings====
In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3.
Ring 0 is the most privilged level, allowing access to the bare hardware components. The operating system kernel must
execute in Ring 0 in order to access the hardware and secure control. User programs execute in Ring 3. Ring 1 and Ring 2 are dedciated to device drivers and other operations.

In virtualization, the host hypervisor executes in Ring 0. While the guest virtual machine should normally executes in Ring 3, when the guest triggers a trap and the trap is handled by the host hypervisor, the guest hypervisor ends up running in Ring 0.

====Models of hardware support====

=====Multiple-level architecture=====
Every parent hypervisor handles every other hypervisor running on top of it. For instance, assume that L0 (host hypervisor) runs the VM L1. When L1 attempts to execute a privileged instruction and a trap occurs, then the parent of L1, which is L0 in this case, will handle the trap. If L1 runs L2, and L2 attempts to execute privileged instructions as well, then L1 will act as the trap handler. More generally, every parent hypervisor at level Ln will act as a trap handler for its guest VM at level Ln+1. This model is not supported by the x86 based systems that are discussed in our research paper.

=====Single-level architecture=====
The model supported by x86 based systems. In this model, everything must go back to the main host hypervisor at the L0 level. For instance, if the host hypervisor (L0) runs L1, when L1 attempts to run its own virtual machine L2, this will trigger a trap that goes down to L0. Then L0 sends the result of the requested instruction back to L1. Generally, a trap at level Ln will be handled by the host hypervisor at level L0 and then the resulting emulated instruction goes back to Ln.

===The uses of nested virtualization===

====Compatibility====
A user can run an application thats not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibily mode of other operating systems or applications, an example of this is the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware.

====Security====
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is basically a hollow program or network that appears to be functioning to outside users, but in reality, its only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machines and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored.

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

=Research problem=

Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have also recently been software based solutions [5], however these solutions suffer from significant performance problems.

The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the numer of control switches between different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it potentially in the worst case reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.

Generally, solutions that requie architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It is for the most part able to contain the problem of a single nested trap expanding into many more trap instructions, at least for the nesting depth the authors considered, which allows efficient virtualization without architectural support.

More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example the paper Accountable Virtual Machines wraps programs around a particular state VM which could most definitely be placed on a separate hypervisor for ideal isolation.

==Theory==

==CPU Virtualization==
How does the Nest VMX virtualization work for the turtle project: L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle. The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously. -csulliva

==Memory virtualization==

How Multi-dimensional paging work on the turtle project. The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

==I/O virtualization==

How does I/O virtualization work on the turtle project? There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

==Macro optimizations==
How they implement the Micro-Optimizations to make it go faster on the turtle project? The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

=Critique=

=== The good ===

The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers and does not affect the end-user in clearly detectable deviation regarding the usage of applications on top this architecture. Nevertheless contribution is visible with respect to security and compatibility. Since this is first successful implementations of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.

=== The bad ===

The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization enters our lives. The performance hit is mainly imposed by exponentially generated exits. Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5, just to see what the effect is. Another significant detriment in the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5 "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested".

=== The style of paper ===

The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does a good job to convey the technical details. Depending on the level of enlightenment towards the background knowledge it appears very complex and personally it required quite some research before my fully delving into the theory of the design. For instance the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1. All in all, the provided video highly in depth increased my awareness in the subject of nested hypervisors.

=== Conclusion ===

Bottom line, the research showed in the paper is the first to achieve efficient x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

=References=
[1] Tanenbaum, Andrew (2007).'' Modern Operating Systems (3rd edition)'', page 569.

[2] Popek & Goldberg (1974). [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CCkQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.141.4815%26rep%3Drep1%26type%3Dpdf&ei=uxD4TL_OOYeSswbbydzZCA&usg=AFQjCNEavbxNIe4sUwidBvE_3S8MXY3fHg&sig2=BS1tG9eadLRrKVItvb6gBg ''Formal requirements for virtualizable 3rd Generation architecture, section 1: Virtual machine concepts'' ]

[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.

[4] Goldberg, P. [http://portal.acm.org/citation.cfm?id=800122.803950 Architecture of Virtual Machines]. In ''Proceedings of the Workshop on Virtual Computer Systems'', ACM pp. 74-112

[5] Berghmans, O. Nesting Virtual Machines in Virtualization Test Frameworks. Master's Thesis, Unversity of Antwerp, 2010.

COMP 3000 Essay 2 2010 Question 9

2010-12-03T01:33:18Z

Mbingham: /* References */ editing reference

'''''Go to discussion for group members confirmation, general talk and paper discussions.'''''

=Paper=

<center><big><big>'''"The Turtles Project: Design and Implementation of Nested Virtualization"'''</big></big></center>

'''Authors:'''
* Muli Ben-Yehuday +
* Michael D. Day ++
* Zvi Dubitzky +
* Michael Factor +
* Nadav Har’El +
* Abel Gordon +
* Anthony Liguori ++
* Orit Wasserman +
* Ben-Ami Yassour +

'''Research labs:'''

+ IBM Research – Haifa

++ IBM Linux Technology Center

'''Website:''' http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

'''Video presentation:''' http://www.usenix.org/multimedia/osdi10ben-yehuda [Note: username and password are required for entry]

=Background Concepts=

Before we delve into the details of our research paper, its essential that we provide some insight and background to the concepts
and notions discussed by the authors.

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest operating system the illusion that its running on the bare hardware. But the reality is, we're actually running the virtual machine as an application on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guests among one another, and with the host hardware and operating system. It also controls host resources. [2]

====Nested virtualization====
The concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.|left|400px]]

====Para-virtualization====
A virtualization model that requires the guest OS kernel to be modified in order to have some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, it rather relies on a software interface that we must implement in the guest kernel so that it can have some privileged hardware access via special instructions called hypercalls. The advantage here is that we have less environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]

===Models of virtualization===

=====Trap and emulate model=====
The trap and emualte model is based on the idea that when a guest hypervisor attempts to execute higher level instructions or access privileged hardware components, it triggers a trap or a fault which gets handled or caught by the host hypervisor. Based on the hardware model of virtualization support, the host hypervisor (L0) then determines whether it should handle the trap or whether it should forwards it to the the responsible parent of that guest hypervisor at a higher level.

====Protection rings====
In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3.
Ring 0 is the most privilged level, allowing access to the bare hardware components. The operating system kernel must
execute in Ring 0 in order to access the hardware and secure control. User programs execute in Ring 3. Ring 1 and Ring 2 are dedciated to device drivers and other operations.

In virtualization, the host hypervisor executes in Ring 0. While the guest virtual machine should normally executes in Ring 3, when the guest triggers a trap and the trap is handled by the host hypervisor, the guest hypervisor ends up running in Ring 0.

====Models of hardware support====

=====Multiple-level architecture=====
Every parent hypervisor handles every other hypervisor running on top of it. For instance, assume that L0 (host hypervisor) runs the VM L1. When L1 attempts to execute a privileged instruction and a trap occurs, then the parent of L1, which is L0 in this case, will handle the trap. If L1 runs L2, and L2 attempts to execute privileged instructions as well, then L1 will act as the trap handler. More generally, every parent hypervisor at level Ln will act as a trap handler for its guest VM at level Ln+1. This model is not supported by the x86 based systems that are discussed in our research paper.

=====Single-level architecture=====
The model supported by x86 based systems. In this model, everything must go back to the main host hypervisor at the L0 level. For instance, if the host hypervisor (L0) runs L1, when L1 attempts to run its own virtual machine L2, this will trigger a trap that goes down to L0. Then L0 sends the result of the requested instruction back to L1. Generally, a trap at level Ln will be handled by the host hypervisor at level L0 and then the resulting emulated instruction goes back to Ln.

===The uses of nested virtualization===

====Compatibility====
A user can run an application thats not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibily mode of other operating systems or applications, an example of this is the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware.

====Security====
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is basically a hollow program or network that appears to be functioning to outside users, but in reality, its only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machines and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored.

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

=Research problem=

Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have also recently been software based solutions [5], however these solutions suffer from significant performance problems.

The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the numer of control switches between different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it potentially in the worst case reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.

Generally, solutions that requie architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It solves the problem of a single nested trap expanding into many more trap instructions, which allows efficient virtualization without architectural support.

More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example the paper Accountable Virtual Machines wraps programs around a particular state VM which could most definitely be placed on a separate hypervisor for ideal isolation.

==Theory==

==CPU Virtualization==
How does the Nest VMX virtualization work for the turtle project: L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle. The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously. -csulliva

==Memory virtualization==

How Multi-dimensional paging work on the turtle project. The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

==I/O virtualization==

How does I/O virtualization work on the turtle project? There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

==Macro optimizations==
How they implement the Micro-Optimizations to make it go faster on the turtle project? The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

=Critique=

=== The good ===

The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers and does not affect the end-user in clearly detectable deviation regarding the usage of applications on top this architecture. Nevertheless contribution is visible with respect to security and compatibility. Since this is first successful implementations of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.

=== The bad ===

The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization enters our lives. The performance hit is mainly imposed by exponentially generated exits. Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5, just to see what the effect is. Another significant detriment in the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5 "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested".

=== The style of paper ===

The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does a good job to convey the technical details. Depending on the level of enlightenment towards the background knowledge it appears very complex and personally it required quite some research before my fully delving into the theory of the design. For instance the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1. All in all, the provided video highly in depth increased my awareness in the subject of nested hypervisors.

=== Conclusion ===

Bottom line, the research showed in the paper is the first to achieve efficient x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

=References=
[1] Tanenbaum, Andrew (2007).'' Modern Operating Systems (3rd edition)'', page 569.

[2] Popek & Goldberg (1974). [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CCkQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.141.4815%26rep%3Drep1%26type%3Dpdf&ei=uxD4TL_OOYeSswbbydzZCA&usg=AFQjCNEavbxNIe4sUwidBvE_3S8MXY3fHg&sig2=BS1tG9eadLRrKVItvb6gBg ''Formal requirements for virtualizable 3rd Generation architecture, section 1: Virtual machine concepts'' ]

[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.

[4] Goldberg, P. [http://portal.acm.org/citation.cfm?id=800122.803950 Architecture of Virtual Machines]. In ''Proceedings of the Workshop on Virtual Computer Systems'', ACM pp. 74-112

[5] Berghmans, O. Nesting Virtual Machines in Virtualization Test Frameworks. Master's Thesis, Unversity of Antwerp, 2010.

COMP 3000 Essay 2 2010 Question 9

2010-12-03T01:30:42Z

Mbingham: /* References */ adding references

'''''Go to discussion for group members confirmation, general talk and paper discussions.'''''

=Paper=

<center><big><big>'''"The Turtles Project: Design and Implementation of Nested Virtualization"'''</big></big></center>

'''Authors:'''
* Muli Ben-Yehuday +
* Michael D. Day ++
* Zvi Dubitzky +
* Michael Factor +
* Nadav Har’El +
* Abel Gordon +
* Anthony Liguori ++
* Orit Wasserman +
* Ben-Ami Yassour +

'''Research labs:'''

+ IBM Research – Haifa

++ IBM Linux Technology Center

'''Website:''' http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

'''Video presentation:''' http://www.usenix.org/multimedia/osdi10ben-yehuda [Note: username and password are required for entry]

=Background Concepts=

Before we delve into the details of our research paper, its essential that we provide some insight and background to the concepts
and notions discussed by the authors.

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest operating system the illusion that its running on the bare hardware. But the reality is, we're actually running the virtual machine as an application on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guests among one another, and with the host hardware and operating system. It also controls host resources. [2]

====Nested virtualization====
The concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.|left|400px]]

====Para-virtualization====
A virtualization model that requires the guest OS kernel to be modified in order to have some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, it rather relies on a software interface that we must implement in the guest kernel so that it can have some privileged hardware access via special instructions called hypercalls. The advantage here is that we have less environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]

===Models of virtualization===

=====Trap and emulate model=====
The trap and emualte model is based on the idea that when a guest hypervisor attempts to execute higher level instructions or access privileged hardware components, it triggers a trap or a fault which gets handled or caught by the host hypervisor. Based on the hardware model of virtualization support, the host hypervisor (L0) then determines whether it should handle the trap or whether it should forwards it to the the responsible parent of that guest hypervisor at a higher level.

====Protection rings====
In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3.
Ring 0 is the most privilged level, allowing access to the bare hardware components. The operating system kernel must
execute in Ring 0 in order to access the hardware and secure control. User programs execute in Ring 3. Ring 1 and Ring 2 are dedciated to device drivers and other operations.

In virtualization, the host hypervisor executes in Ring 0. While the guest virtual machine should normally executes in Ring 3, when the guest triggers a trap and the trap is handled by the host hypervisor, the guest hypervisor ends up running in Ring 0.

====Models of hardware support====

=====Multiple-level architecture=====
Every parent hypervisor handles every other hypervisor running on top of it. For instance, assume that L0 (host hypervisor) runs the VM L1. When L1 attempts to execute a privileged instruction and a trap occurs, then the parent of L1, which is L0 in this case, will handle the trap. If L1 runs L2, and L2 attempts to execute privileged instructions as well, then L1 will act as the trap handler. More generally, every parent hypervisor at level Ln will act as a trap handler for its guest VM at level Ln+1. This model is not supported by the x86 based systems that are discussed in our research paper.

=====Single-level architecture=====
The model supported by x86 based systems. In this model, everything must go back to the main host hypervisor at the L0 level. For instance, if the host hypervisor (L0) runs L1, when L1 attempts to run its own virtual machine L2, this will trigger a trap that goes down to L0. Then L0 sends the result of the requested instruction back to L1. Generally, a trap at level Ln will be handled by the host hypervisor at level L0 and then the resulting emulated instruction goes back to Ln.

===The uses of nested virtualization===

====Compatibility====
A user can run an application thats not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibily mode of other operating systems or applications, an example of this is the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware.

====Security====
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is basically a hollow program or network that appears to be functioning to outside users, but in reality, its only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machines and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored.

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

=Research problem=

Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have also recently been software based solutions [5], however these solutions suffer from significant performance problems.

The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the numer of control switches between different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it potentially in the worst case reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.

Generally, solutions that requie architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It solves the problem of a single nested trap expanding into many more trap instructions, which allows efficient virtualization without architectural support.

More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example the paper Accountable Virtual Machines wraps programs around a particular state VM which could most definitely be placed on a separate hypervisor for ideal isolation.

==Theory==

==CPU Virtualization==
How does the Nest VMX virtualization work for the turtle project: L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle. The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously. -csulliva

==Memory virtualization==

How Multi-dimensional paging work on the turtle project. The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

==I/O virtualization==

How does I/O virtualization work on the turtle project? There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

==Macro optimizations==
How they implement the Micro-Optimizations to make it go faster on the turtle project? The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

=Critique=

=== The good ===

The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers and does not affect the end-user in clearly detectable deviation regarding the usage of applications on top this architecture. Nevertheless contribution is visible with respect to security and compatibility. Since this is first successful implementations of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.

=== The bad ===

The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization enters our lives. The performance hit is mainly imposed by exponentially generated exits. Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5, just to see what the effect is. Another significant detriment in the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5 "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested".

=== The style of paper ===

The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does a good job to convey the technical details. Depending on the level of enlightenment towards the background knowledge it appears very complex and personally it required quite some research before my fully delving into the theory of the design. For instance the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1. All in all, the provided video highly in depth increased my awareness in the subject of nested hypervisors.

=== Conclusion ===

Bottom line, the research showed in the paper is the first to achieve efficient x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

=References=
[1] Tanenbaum, Andrew (2007).'' Modern Operating Systems (3rd edition)'', page 569.

[2] Popek & Goldberg (1974). [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CCkQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.141.4815%26rep%3Drep1%26type%3Dpdf&ei=uxD4TL_OOYeSswbbydzZCA&usg=AFQjCNEavbxNIe4sUwidBvE_3S8MXY3fHg&sig2=BS1tG9eadLRrKVItvb6gBg ''Formal requirements for virtualizable 3rd Generation architecture, section 1: Virtual machine concepts'' ]

[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.

[4] Goldberg, P. Architecture of Virtual Machines. In ''Proceedings of the Workshop on Virtual Computer Systems'', ACM pp. 74-112

[5] Berghmans, O. Nesting Virtual Machines in Virtualization Test Frameworks. Master's Thesis, Unversity of Antwerp, 2010.

COMP 3000 Essay 2 2010 Question 9

2010-12-03T01:26:43Z

Mbingham: /* Research problem */

'''''Go to discussion for group members confirmation, general talk and paper discussions.'''''

=Paper=

<center><big><big>'''"The Turtles Project: Design and Implementation of Nested Virtualization"'''</big></big></center>

'''Authors:'''
* Muli Ben-Yehuday +
* Michael D. Day ++
* Zvi Dubitzky +
* Michael Factor +
* Nadav Har’El +
* Abel Gordon +
* Anthony Liguori ++
* Orit Wasserman +
* Ben-Ami Yassour +

'''Research labs:'''

+ IBM Research – Haifa

++ IBM Linux Technology Center

'''Website:''' http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

'''Video presentation:''' http://www.usenix.org/multimedia/osdi10ben-yehuda [Note: username and password are required for entry]

=Background Concepts=

Before we delve into the details of our research paper, its essential that we provide some insight and background to the concepts
and notions discussed by the authors.

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as a virtual machine, this emulation usually consists of a guest hypervisor and a virtualized environment, giving the guest operating system the illusion that its running on the bare hardware. But the reality is, we're actually running the virtual machine as an application on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on full-virtualization of hardware within the context of operating systems.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), a hypervisor is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guests among one another, and with the host hardware and operating system. It also controls host resources. [2]

====Nested virtualization====
The concept of recursively running one or more virtual machines inside another virtual machine. For instance, the main operating system hypervisor (L0) can run the virtual machines L1, L2 and L3. In turn, each of those virtual machines is able to run its own virtual machines, and so on (Figure 1).
[[File:virtualization2.png|thumb|right|Figure 1: Nested virtualization. The guest hypervisor denotes the creation of a virtual machine.|left|400px]]

====Para-virtualization====
A virtualization model that requires the guest OS kernel to be modified in order to have some direct access to the host hardware. In contrast to full-virtualization that we discussed in the beginning of the article, para-virtualization does not simulate the entire hardware, it rather relies on a software interface that we must implement in the guest kernel so that it can have some privileged hardware access via special instructions called hypercalls. The advantage here is that we have less environment switches and interaction between the guest and host hypervisors, thus more efficiency. However, portability is an obvious issue, since a system can be para-virtualized to be compatible with only one hypervisor. Another thing to note, is that some operating systems such as Windows don't support para-virtualization. [3]

===Models of virtualization===

=====Trap and emulate model=====
The trap and emualte model is based on the idea that when a guest hypervisor attempts to execute higher level instructions or access privileged hardware components, it triggers a trap or a fault which gets handled or caught by the host hypervisor. Based on the hardware model of virtualization support, the host hypervisor (L0) then determines whether it should handle the trap or whether it should forwards it to the the responsible parent of that guest hypervisor at a higher level.

====Protection rings====
In modern operating system, there are four levels of access privilge, called Rings, that range from 0 to 3.
Ring 0 is the most privilged level, allowing access to the bare hardware components. The operating system kernel must
execute in Ring 0 in order to access the hardware and secure control. User programs execute in Ring 3. Ring 1 and Ring 2 are dedciated to device drivers and other operations.

In virtualization, the host hypervisor executes in Ring 0. While the guest virtual machine should normally executes in Ring 3, when the guest triggers a trap and the trap is handled by the host hypervisor, the guest hypervisor ends up running in Ring 0.

====Models of hardware support====

=====Multiple-level architecture=====
Every parent hypervisor handles every other hypervisor running on top of it. For instance, assume that L0 (host hypervisor) runs the VM L1. When L1 attempts to execute a privileged instruction and a trap occurs, then the parent of L1, which is L0 in this case, will handle the trap. If L1 runs L2, and L2 attempts to execute privileged instructions as well, then L1 will act as the trap handler. More generally, every parent hypervisor at level Ln will act as a trap handler for its guest VM at level Ln+1. This model is not supported by the x86 based systems that are discussed in our research paper.

=====Single-level architecture=====
The model supported by x86 based systems. In this model, everything must go back to the main host hypervisor at the L0 level. For instance, if the host hypervisor (L0) runs L1, when L1 attempts to run its own virtual machine L2, this will trigger a trap that goes down to L0. Then L0 sends the result of the requested instruction back to L1. Generally, a trap at level Ln will be handled by the host hypervisor at level L0 and then the resulting emulated instruction goes back to Ln.

===The uses of nested virtualization===

====Compatibility====
A user can run an application thats not compatible with the existing or running OS as a virtual machine. Operating systems could also provide the user a compatibily mode of other operating systems or applications, an example of this is the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites to host their API and databases on Amazon's hardware.

====Security====
We can also use nested virtualization for security purposes. One common example is virtual honeypots. A honeypot is basically a hollow program or network that appears to be functioning to outside users, but in reality, its only there as a security tool to watch or trap hacker attacks. By using nested virtualization, we can create a honeypot of our system as virtual machines and see how our virtual system is being attacked or what kind of features are being exploited. We can take advantage of the fact that those virtual honeypots can easily be controlled, manipulated, destroyed or even restored.

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

=Research problem=

Nested virtualization has been studied since the mid 1970s [4]. Early reasearch in the area assumes that there is hardware support for nested virtualization. Actual implementations of nested virtualization, such as the z/VM hypervisor in the early 1990s, also required architectural support. Other solutions assume the hypervisors and operating systems being virtualized have been modified to be compatabile with nested virtualization. There have also recently been software based solutions [5], however these solutions suffer from significant performance problems.

The main barrier to having nested virtualization without architectural support is that, as you increase the levels of virtualization, the numer of control switches between different levels of hypervisors increases. A trap in a highly nested virtual machine first goes to the bottom level hypervisor, which can send it up to the second level hypervisor, which can in turn send it up (or back down), until it potentially in the worst case reaches the hypervisor that is one level below the virtual machine itself. The trap instruction can be bounced between different levels of hypervisor, which results in one trap instruction multiplying to many trap instructions.

Generally, solutions that requie architectural support and specialized software for the guest machines are not practically useful because this support does not always exist, such as on x86 processors. Solutions that do not require this suffer from significant performance costs because of how the number of traps expands as nesting depth increases. This paper presents a technique to reconcile the lack of hardware support on available hardware with efficiency. It solves the problem of a single nested trap expanding into many more trap instructions, which allows efficient virtualization without architectural support.

More specifically, virtualization deals with how to share the resources of the computer between multiple guest operating systems. Nested virtualization must share these resources between multiple guest operating systems and guest hypervisors. The authors acknowledge the CPU, memory, and IO devices as the three key resources that they need to share. Combining this, the paper presents a solution to the problem of how to multiplex the CPU, memory, and IO efficiently between multiple virtual operating systems and hypervisors on a system which has no architectural support for nested virtualization.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

The non stop evolution of computers entices intricate designs that are virtualized and harmonious with cloud computing. The paper contributes to this belief by allowing consumers and users to inject machines with '''their''' choice of hypervisor/OS combination that provides grounds for security and compatibility. The sophisticated abstractions presented in the paper such as shadow paging and isolation of a single OS resources authorize programmers for further development and ideas which use this infrastructure. For example the paper Accountable Virtual Machines wraps programs around a particular state VM which could most definitely be placed on a separate hypervisor for ideal isolation.

==Theory==

==CPU Virtualization==
How does the Nest VMX virtualization work for the turtle project: L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle. The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously. -csulliva

==Memory virtualization==

How Multi-dimensional paging work on the turtle project. The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

==I/O virtualization==

How does I/O virtualization work on the turtle project? There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

==Macro optimizations==
How they implement the Micro-Optimizations to make it go faster on the turtle project? The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

=Critique=

=== The good ===

The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers and does not affect the end-user in clearly detectable deviation regarding the usage of applications on top this architecture. Nevertheless contribution is visible with respect to security and compatibility. Since this is first successful implementations of this type that does not modify hardware (there have been half decent research designs), we expect to see increased interest in nested integration model described above. The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique) which sounds very appealing.

=== The bad ===

The main drawback is efficiency which appears as the authors introduce additional level of abstraction. The everlasting memory/efficiency dispute continues as nesting virtualization enters our lives. The performance hit is mainly imposed by exponentially generated exits. Furthermore we observed that the paper performs tests at the L2 level, a guest with two hypervisors below it. It might have been useful to understand the limits of nesting if they investgated higher level of nesting such as L4 or L5, just to see what the effect is. Another significant detriment in the paper links to optimizations such as vmread/vmwrite operations avoidance which are aimed at specific CPUs as stated on page 7, section 3.5 "(...) this optimization does not strictly adhere to the VMX specifications, and thus might not work on processors other than the ones we have tested".

=== The style of paper ===

The paper presents an elaborate description of the concept of nested virtualization in a very specific manner. It does a good job to convey the technical details. Depending on the level of enlightenment towards the background knowledge it appears very complex and personally it required quite some research before my fully delving into the theory of the design. For instance the paragraph 4.1.2 "Impact of Multidimensional paging" attempts to illustrate the technique by an example with terms such as ETP and L1. All in all, the provided video highly in depth increased my awareness in the subject of nested hypervisors.

=== Conclusion ===

Bottom line, the research showed in the paper is the first to achieve efficient x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

=References=
[1] Tanenbaum, Andrew (2007).'' Modern Operating Systems (3rd edition)'', page 569.

[2] Popek & Goldberg (1974). [http://www.google.ca/url?sa=t&source=web&cd=3&ved=0CCkQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.141.4815%26rep%3Drep1%26type%3Dpdf&ei=uxD4TL_OOYeSswbbydzZCA&usg=AFQjCNEavbxNIe4sUwidBvE_3S8MXY3fHg&sig2=BS1tG9eadLRrKVItvb6gBg ''Formal requirements for virtualizable 3rd Generation architecture, section 1: Virtual machine concepts'' ]

[3] Tanenbaum, Andrew (2007). ''Modern Operating Systems (3rd edition)'', page 574-576.

Talk:COMP 3000 Essay 2 2010 Question 9

2010-12-02T23:55:50Z

Mbingham: /* General discussion */

=Group members=

* Munther Hussain
* Jonathon Slonosky
* Michael Bingham
* Chris Sullivan
* Pawel Raubic

=Group work=
* Background concepts: Munther Hussain
* Research problem: Michael Bingham
* Contribution:
* Critique:

=General discussion=

Hey there, this is Munther. The prof said that we should be contacting each other to see whos still on board for the course. So please
if you read this, add your name to the list of members above. You can my find my contact info in my profile page by clicking my signature. We shall talk about the details and how we will approach this in the next few days --[[User:Hesperus|Hesperus]] 16:41, 12 November 2010 (UTC)

---------------------

Checked in -- JSlonosky

----------------------

Pawel has already contacted us so he still in for the course, that makes 3 of us. The other three members, please drop in and add your name. We need to confirm the members today by 1:00 pm. --[[User:Hesperus|Hesperus]] 12:18, 15 November 2010 (UTC)

----------------------

Checked in --[[User:Mbingham|Mbingham]] 15:08, 15 November 2010 (UTC)

---------------------

Checked in --[[User:Smcilroy|Smcilroy]] 17:03, 15 November 2010 (UTC)

---------------------
To the person above me (Smcilroy): I can see that you're assigned to group 7 and not this one. So did the prof move you to this group or something ? We haven't confirmed or emailed the prof yet, I will wait until 1:00 pm. --[[User:Hesperus|Hesperus]] 17:22, 15 November 2010 (UTC)

---------------------
Alright, so I just emailed the prof the list of members that have checked in so far (the names listed above plus Pawel Raubic),
Smcilroy: I still don't know whether you're in this group or not, though I don't see your name listed in the group assignments on the course webpage. To the other members: if you're still interested in doing the course, please drop in here and add your name or even email me, you can find my contact info in my profile page(just click my signature).

Personally speaking, I find the topic of this article (The Turtle Project) to be quite interesting and approachable, in fact we've
already been playing with VirtualBox and VMWare and such things, so we should be familiar with some of the concepts the article
approaches like nested-virtualization, hypervisors, supervisors, etc, things that we even covered in class and we can in fact test on our machines. I've already started reading the article, hopefully tonight we'll start posting some basic ideas or concepts and talk about the article in general. I will be in tomorrow's tutorial session in the 4th floor in case some of you guys want to get to know one another. --[[User:Hesperus|Hesperus]] 18:43, 15 November 2010 (UTC)
-----
Yeah, it looks pretty good to me. Unfortunately, I am attending Ozzy Osbourne on the 25th, so I'd like it if we could get ourselves organized early so I can get my part done and not letting it fall on you guys. Not that I would let that happen --JSlonosky 02:51, 16 November 2010 (UTC)
-----
Why waste your money on that old man ? I'd love to see Halford though, I'm sure he'll do some classic Priest material, haven't checked the new record yet, but the cover looks awful, definitely the worst and most ridiculous cover of the year. Anyways, enough music talk. I think we should get it done at least on 24th, we should leave the last day to do the editing and stuff. I removed Smcilroy from the members list, I think he checked in here by mistake because I can see him in group 7. So far, we're 5, still missing one member. --[[User:Hesperus|Hesperus]] 05:36, 16 November 2010 (UTC)
-----
Yeah that would be pretty sweet. I figured I might as well see him when I can; Since he is going to be dead soon. How is he not already? Alright well, the other member should show up soon, or I'd guess that we are a group of 5. --JSlonosky 16:37, 16 November 2010 (UTC)

-----
Hey dudes. I think we need to get going here.. the paper is due in 4 days. I just did the paper intro section (provided the title, authors, research labs, links, etc.). I have read the paper twice so far and will be spending the whole day working on the background concepts and the research problem sections.

I'm still not sure on how we should divide the work and sections among the members, especially regarding the research contribution and critique, I mean those sections should not be based or written from the perspective of one person, we all need to work and discuss those paper concepts together.

If anyone wants to add something, then please add but don't edit or alter the already existing content. Lets try to get as many thoughts/ideas as possible and then we will edit and filter the redundancy later. And lets make sure that we add summary comments to our edits to make it easier to keep track of everything.

Also, we're still missing one member: Shawn Hansen. Its weird because on last Wednesday's lab, the prof told me that he attended the lab and signed his name, so he should still be in the course. --[[User:Hesperus|Hesperus]] 18:07, 21 November 2010 (UTC)

---------
Yeah man. We really do need to get on this. Not going to ozzy so I got free time now. I am reading it again to refresh my memory of it and will put notes of what I think we can criticize about it and such. What kind of references do you think we will need? Similar papers etc?
If you need to a hold of me. Best way is through email. jslonosk@connect.Carleton.ca. And if that is still in our group but doesn't participate, too bad for him--JSlonosky 14:42, 22 November 2010 (UTC)
----------
The section on the related work has all the things we need to as far as other papers go. Also, I was able to find other research papers that are not mentioned in the paper. I will definitely be adding those paper by tonight. For the time being, I will handle the background concepts. I added a group work section below to keep track of whos doing what. I should get the background concept done hopefully by tonight. If anyone want to help with the other sections that would be great, please add your name to the section you want to handle below.

I added a general paper summary below just to illustrate the general idea behind each section. If anybody wants to add anything, feel free to do so. --[[User:Hesperus|Hesperus]] 18:55, 22 November 2010 (UTC)
-----------
I remember the prof mentioned the most important part of the paper is the Critique so we gotta focus on that altogether not just one person for sure.--[[User:Praubic|Praubic]] 19:22, 22 November 2010 (UTC)

-------------
Yeah absloutely, I agree. But first, lets pin down the crucial points. And then we can discuss them collectively. If anyone happens to come across what he thinks is good or bad, then you can add it below to the good/bad points. Maybe the group work idea is bad, but I just thought maybe if we each member focuses on a specific part in the beginning, we can maybe have a better overall idea of what the paper is about. --[[User:Hesperus|Hesperus]] 19:42, 22 November 2010 (UTC)
--------------
Ok, another thing I figured is that the paper doesn't directly hint at why nested virtualization is necessary? I posted a link in references and I'l try to research more into the purpose of nested virtualization.--[[User:Praubic|Praubic]] 19:45, 22 November 2010 (UTC)
--------------
Actually the paper does talk about that. Look at the first two paragraphs in the introduction section of the paper on page 1. But you're right, they don't really elaborate, I think its because its not the purpose or the aim of the paper in the first place. --[[User:Hesperus|Hesperus]] 20:31, 22 November 2010 (UTC)
--------------
The stuff that Michael provided are excellent. That was actually what I was planning on doing. I will start by defining virtualization, hypervisors, computer ring security, the need and uses of nested virtualization, the models, etc. --[[User:Hesperus|Hesperus]] 22:14, 22 November 2010 (UTC)
-------------
So here my question who doing what in the group work and where should I focus my attention to do my part?- Csulliva
-------------
I have posted few things regarding the background concepts on the main page. I will go back and edit it today and talk about other things like: nested virtualization, the need and advantages of NV, the models, the trap and emulate model of x86 machines, computer paging which is discussed in the paper, computer ring security which again they touch on at some point in the paper. I can easily move some of the things I wrote in the theory section to the main page, but I want to consult the prof first on some of those things.

One thing that I'm still unsure of is how far should we go here ? should we provide background on the hardware architecture used by the authors like the x86 family and the VMX chips, or maybe some of the concepts discussed later on in the testing such as optimization, emulation, para-virtualization ?

I will speak and consult the prof today after our lecture. If other members want to help, you guys can start with the related work and see how the content of the paper compares to previous or even current research papers. --[[User:Hesperus|Hesperus]] 08:08, 23 November 2010 (UTC)
------------------------
In response to what Michael mentioned above in the background section: we should definitely talk about that, from what I understood, they apply the same model (the trap and emulate) but they provide optimizations and ways to increase the trap calls efficiency between the nested environments, so thats definitely a contribution, but its more of a performance optimization kind of contribution I guess, which is why I mentioned the optimizations in the contribution section below. --[[User:Hesperus|Hesperus]] 08:08, 23 November 2010 (UTC)
---------------------------
'''Ok, so for those who didn't attend today's lecture, the prof was nice enough to give us an extension for the paper, the due date now is Dec 2nd.''' And thats really good, given that some of those concepts require time to sort of formulate. I also asked the prof on the approach that we should follow in terms of presenting the material, and he mentioned that you need to provide enough information for each section to make your follow student understand what the paper is about without them having to actually read the paper or go through it in detail. He also mentioned the need to distill some of the details, if the paper spends a whole page explaining multi-dimensional paging, we should probably explain that in 2 small paragraphs or something.

Also, we should always cite resources. If the resource is a book, we should cite the page number as well. --[[User:Hesperus|Hesperus]] 15:16, 23 November 2010 (UTC)
---------------------------
Yeah I am really thankful he left us with another week to do it. I am sure we all have at least 3 projects due soon, other than this Essay. I'll type up the stuff that I had highlighted for Tuesday as a break tomorrow. I was going to do it yesterday but he gave us an extension, so I slacked off a bit. I also forgot :/ --JSlonosky 23:43, 24 November 2010 (UTC)
---------------------------
Hey dudes. I have posted the first part of the backgrounds concept here in the discussion and on the main page as well. This is just a rough version, so I will be constantly expanding it and adding resources later on today. I have also created and added a diagram for illustration, as far as I know, we should be allowed to do this. If anyone have any suggestions to what I have posted or any counter arguments, please discuss. I will also be moving some of the stuff I wrote here (the theory section) to the main page as well.

Regarding the critique, I guess the excessive amount of exits can somehow be seen as a '''scalability''' constraint, maybe making the overall design somehow too complex or difficult to get a hold of, I'm not sure about this, but just guessing from a general programming point of view. I will email the prof today, maybe he can give us some hints for what can be considered a weakness or a bad spot if you will in the paper.

Also, we're still missing the sixth member of the group: Shawn Hansen. --[[User:Hesperus|Hesperus]] 06:57, 29 November 2010 (UTC)

----------------------------

Hey guys. I can start working on the research problem part of the essay. I'll put it up here when I have a rough version than move it to the actual article. As for the critique section, how about we put a section on the talk page here and people can add in what they thought worked/didn't work with some explanation/references, and then we can get someone/some people to combine it and put it in the essay?
--[[User:Mbingham|Mbingham]] 18:13, 29 November 2010 (UTC)

----------------------------

Yea really, great work on the Background. It's looking slick. I added some initial edit in the Contribution and Critique but I agree lets open a thread here and All collaborate. --[[User:Praubic|Praubic]] 18:24, 30 November 2010 (UTC)
----------------------------

Nice man. Sorry I haven't updated with anything that I have done yet, but I'll have it up later today or tomorrow. I got both an Essay and game dev project done for tomorrow, so after 1 I will be free to work on this until it is time for 3004--JSlonosky 13:41, 30 November 2010 (UTC)

----------------------------

I put up an initial version of the research problem section in the article. Let me know what you guys think. --[[User:Mbingham|Mbingham]] 19:53, 30 November 2010 (UTC)

----------------------------
Hey guys. Since I'm working on the backgrounds concepts and Michael is handling the research problem. The other members should handle the contribution part. I think everything we need for the contribution section is in section 3 of the article (3.1, 3.2, 3.3, 3.4, 3.5). You can also make use of the things we posted here. Just to be on the safe side, we need to get this done by tomorrow's night. I'm working on a couple of definitions as we speak and will hopefully be done by tomorrow's morning.

PS: We should leave the critique to the end, there should not be a lot of writing for that part and we must all contribute.

--[[User:Hesperus|Hesperus]] 01:45, 1 December 2010 (UTC)
-----------------------------
Just posted other bits that were missing in the backgrounds concepts section like the security uses, models of virtualization and para-virtualization. They're just a rough version however. I will edit them in the next few hours.I just need to write something for protection rings and that would be it I guess.

I can help with the other sections for the rest of the day, I will try to post some summaries for performance and implementation or even the related work. --[[User:Hesperus|Hesperus]] 07:26, 1 December 2010 (UTC)

-----------------------------
Guys, we need to get moving here.. The contribution section still needs a lot. We need to talk about their innovations and the things they did there:
CPU virtualization, Memory virtualization, I/O virtualization and the Macro-optimizations.

I will be posting something regarding this in the next few hours. --[[User:Hesperus|Hesperus]] 22:53, 1 December 2010 (UTC)

-----------------------------

I have looked over the paper again and I am wondering about some things. How are we to critique it? By their methods, or by the paper itself?
I find that in the organization of the paper, they give you the links and extra information to look more in depth on such things like the VMC technology, but they almost use that as an excuse for not explaining things in the paper.
The VMC(0 ->1) annotation that isn't explained. I understand what they mean, but it seems that they assume that you already know some things. --JSlonosky 03:03, 2 December 2010 (UTC)
-----------------
I think most research papers follow that kind of approach, they vaguely talk about the sideline things and provide references. The VMC technology from what I understood is just a creation of an environment to link or switch between hypervisors. --[[User:Hesperus|Hesperus]] 03:26, 2 December 2010 (UTC)

-----------------------------

The instructions say that both style and content can be critiqued. I guess the organization of the paper would fall under style, but i'm not sure how fair it is to critique how much they go in depth on certain things, especially some background stuff. After all, the audience of this paper is people who are already well versed in OS and virtualization stuff. That's not to say that we shouldn't bring it up, especially if we feel they don't sufficiently explain a new technique or notation they are using.

I think it's also important to remember that our critique will contain things they have done well, not just things they could have done better. Considering that this paper got the best paper award at the largest OS conference, I think it's safe to say our critique will have many more good things than bad.

Here's some things they have done well on first inspection, just to get some ideas out there:
* Solution is extensible to an arbitrary nesting depth without major loss of performance
* Solution doesn't depend on modified hardware or software (expect for the lowest level hypervisor) we can reference previous solutions that do require modifications
* The paper doesn't ignore virtualizing I/O devices to an arbitrary nesting depth, other techniques do
* I think the paper does well in laying out the theoretical approach to the problem, as well as demonstrating impressive empirical results.

I'll have some time to work on this tomorrow, probably clean up the research problem section, maybe kick off the contribution section if no one's started it, and put up some more extensive stuff for the critique. Let me know what you guys think, i'm off to bed pretty soon, haha! --[[User:Mbingham|Mbingham]] 03:41, 2 December 2010 (UTC)

-----------------------------

Okay, thanks for the clear up man. Sounds good. I'll see what else I can do in between other work I got to do tonight.
One thing we should remember is to make sure that our essay clearly answers the question that is directed to it on the exam review. If we get some other good ideas for questions, we should submit those to Anil as well.
Questions 1 and 2 relate to our essay, in my mind.
"What are two uses for nested virtual machines?
Multi-dimensional page tables are designed to avoid using shadow page tables in nested virtualization. What are shadow page tables, and when must they be used?"
--JSlonosky 04:47, 2 December 2010 (UTC)

-----------------------------
Hey guys. The points that Michael mentioned sound pretty great. I think the critique more or less depends on our understanding of the paper, so its not like theres a specific answer or something.
I will also be seeing the prof tomorrow in his office hours if anyone wants to join me, I will post something here before I go.

The backgrounds section is done. I will keep editing it and filter some of the information. I don't have a lot of things to do today, so I will spend the whole day working on the paper and editing it and adding the references. I added some sub-sections for the contributions section. The theory part should just talk about the way they're flattening the levels of virtualization and multiplexing the hardware, I will try to write something for this. Then we go into the CPU, Memory, I/O and optimization. And I can see that someone already handled those things here in the discussion. So we're pretty much done.

'''PS: Guys, please don;t forget about the references. We don't wanna get into any trouble with the prof in that regard.'''
--[[User:Hesperus|Hesperus]] 08:51, 2 December 2010 (UTC)
------------------
Alright, I will do some Contribution section today or tonight so no worries. The critique, as i said i had added some stuff there but we still need to debate the good and bad of the design as perceived by our opinions since its a critique we can use first person. "I" and "To me". --[[User:Praubic|Praubic]] 15:37, 2 December 2010 (UTC)
-------------------
Also, if each of us could contribute to the Critique part (here in the discussion) in point form and then we glue it together in concise sentences? We have to get straight to point. We are not aiming for length, rather content as you all know obviously. --[[User:Praubic|Praubic]] 15:53, 2 December 2010 (UTC)
--------------------
Actually the contributions section is outlined below in the implementation here in the discussion page. So whoever did that should edit it and take it to the main page. I'm ging to the office hours in 2 hours from now to ask the prof a couple of things incluing the critique. --[[User:Hesperus|Hesperus]] 15:58, 2 December 2010 (UTC)
--------------------
I was just looking over the background concepts section, and had a couple of questions. Firstly, would it be possible to maybe scale the image down and have the text flow around it? Right now it seems to break the "flow" a bit, if that makes sense. Secondly, I think maybe we should think about consolidating some of the sub headings and stuff, I think it breaks the flow of the paper if we have a whole bunch of sub headings that only have a couple of sentences of explanation. Also, I added some stuff to the critique section on the talk page here (right at the bottom). I'll add some more later. Let me know what you guys think, and let us know how the meeting with Anil goes Hesperus. If I have time I'll try to come, but i've got two other projects on the go right now too, haha. --[[User:Mbingham|Mbingham]] 16:56, 2 December 2010 (UTC)
------------------------
Honestly, I don't know how to scale down the picture and make the text flow through it but I will try later on tonight to resize it and make it smaller. Regarding the headings, yeah I can do that. I got sort of caught up with a lot of the terms and categorizations. I was even thinking about taking off the multiple-hardware support model, because its briefly mentioned in the paper and its not even available in x86 machines. I will ask the prof about those things. I will be seeing him in 30-40 minutes from now, his office hours start at 1:00 pm. Also if you guys notice any typos or misspellings, don't worry I will be editing the whole thing tonight. --[[User:Hesperus|Hesperus]] 17:36, 2 December 2010 (UTC)
--------------------------
Guys.. whoever did the implementation section below which is basically the contribution, should try to edit it and take it to the main page, I have already provided the headings for the contribution in the main page. I'm currently working on the theory bit in that very same section. --[[User:Hesperus|Hesperus]] 17:43, 2 December 2010 (UTC)
---------------------------

That was Csullvia. I will go ahead and do it for him if he can't and has something else to do.--JSlonosky 18:32, 2 December 2010 (UTC)
-------
Ok, I didn't want to edit it myself, because I don't want to sound repetitive or redundant in my style. The prof should be locking the wiki sometime tomorrow at 7:00 am or 8:00 am, so we better get this finished tonight by 12 or something.

I went and spoke with the prof in his office an hour ago. Regarding the critique, he pointed out a few things that I will be working on in the next few hours like the complexity of their design and whether it would remain efficent when applying multiple levels of virtualization. So I will write something on that, maybe we can combine our points into one paragraph or something.

The headings, he said are fine. But he did mention that the article should make sense or be readable if we remove the headings or the section titles. I will be watching the discussion page frequently for comments and discussion. --[[User:Hesperus|Hesperus]] 20:10, 2 December 2010 (UTC)
--------
If you don't see any update until late at night, don't worry, I'm coming back to do one final edit and grammar check for the whole article.
'''But please guys, if you have used any resources, then don't forget to add them'''. --[[User:Hesperus|Hesperus]] 21:31, 2 December 2010 (UTC)
----------
Cool. I see that Chris has added the contributions to the main page. I'm currently adding the resources and will be adding a few other things later. --[[User:Hesperus|Hesperus]] 23:33, 2 December 2010 (UTC)

Nice, nice I'm currently working on Critique section. Anticipate updates, modify at will. --[[User:Praubic|Praubic]] 23:40, 2 December 2010 (UTC)

----

I'm just writing out the good copy of another assignment, I should be done in about an hour and can work on whatever needs working on. --[[User:Mbingham|Mbingham]] 23:55, 2 December 2010 (UTC)

=Paper summary=
==Background Concepts and Other Stuff==

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as virtual machine, this emulation which includes a guest hypervisor and a virtualized environment, only gives an illusion to the guest virtual machine to make it think that its running directly on the main hardware. In other words, we can view this virtual machine as an application running on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on hardware virtualization within operating systems environments.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guest virtual machines among one another, and the interaction with the host hardware and operating system. It also controls host resources.

====Nested virtualization====
Nested virtualization is the concept of recursively running one or more virtual machines inside one another. For instance, the main operating system (L1) runs a VM called L2, in turn, L2 runs another VM L3, L3 then runs L4 and so on.

====Para-virtualization====
[Coming....]

===Trap and emulate model===
A vitualization model based on the idea that when a guest hypervisor attempts to execute, gain or access privilged hardware context, it triggers a trap or a fault which gets handled or caught by the host hypervisor. The host hypervisor then determines whether this instruction should be allowed to execute or not. Then based on that, the host hypervisor provides an emulation of the requested outcome to the guest hypervisor. The x86 systems discussed in the Turtles Project research paper follows this model.

===The uses of nested virtualization===

====Compatibility====
A system could provide the user with a compatibility mode for other operatng systems or applications. An example of this would
be the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites such as NetFlix to host their API and database on Amazon's hardware.

====Security====
[Coming...]

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

===Protection rings===
[Coming....]

EDIT: Just noticed that someone has put their name down to do the background concept stuff, so Munther feel free to use this as a starting point if you like.

The above looks good. I thought id maybe start touching on some of the sections, so let me know what you guys think. Heres what I think would be useful to go over in the Background Concepts section:

* Firstly, nested virtualization. Why we use nested virtualization (paper gives example of XP inside win 7). Maybe going over the trap and emulate model of nested virtualization.
* Some of the terminology of nested virtualization. The difference between guest/host hypervisors (we're already familiar with guest/host OSs), the terminology of L0, ..., Ln with L0 being the bottom hypervisor, etc
* x86 nested virtualization limitations. Single level architecture, guest/host mode, VMX instructions and how to emulate them. Some of this is in section 3.2of the paper.

Again, anything else you guys think we should add would be great.

Commenting some more on the above summary, under the "main contributions" part, do you think we should count the nested VMX virtualization part as a contribution? If we have multiplexing memory and multiplexing I/O as a main contribution, it would seem to make sense to have multiplexing the CPU as well, especially within the limitations of the x86 architecture. Unless they are using someone else's technique for virtualizing these instructions.--[[User:Mbingham|Mbingham]] 21:16, 22 November 2010 (UTC)
==Research problem==
The paper provides a solution for Nested-virtualization on x86 based computers. Their approach is software-based, meaning that, they're not really altering the underlying architecture, and this is basically the most interesting thing about the paper, since x86 computers don't support nested-virtualization in terms of hardware, but apparently they were able to do it.

The goal of nested virtualization and multiple host hypervisors comes down to efficiency. Example: Virtualization on servers has been rapidly gaining popularity. The next evolution step is to extend a single level of memory management virtualization support to handle nested virtualization, which is critical for ''high performance''. [1]

'''How does the concept apply to the quickly developing cloud computing?'''

Cloud user manages his own virtual machine directly through a hypervisor of choice. In addition it provides increased security by hypervicsor-level intrusion detection.

==Related work==

Comparisons with other related/similar research and work:

Refer to the following website and to the related work section in the paper regarding this section:
http://www.spinics.net/lists/kvm/msg43940.html

[This is a forum post by one of the authors of our assigned paper where he talks about more recent research work on virtualization, particularly in his first paragraph, he refers to some more recent research by the VMWare technical support team. He also talks about some of the research papers referred to in our assigned paper.]

==Theory (Section 3.1)==

Apparently, theres 2 models to applying nested-virtualization:

* Multiple-level architecture support: where every hypervisor handles every other hypervisor running on top of it. For instance, if L0 (host hypervisor) runs L1. If L1 attempts to run L2, then the trap handling and the work needed to be done to allow L1 to instantiate a new VM is handled by L0. More generally, if L2 attempts to created its own VM, then L1 will handle the trap handling and such.

* Single-level architecture support: This is the model supported by the x86 machines. This model is tied into the concept of "Trap and emulate", where every hypervisor tries to emulate the underlying hardware (the VMX chip in the paper implementation) and presents a fake ground for the hypervisor running on top of it (the guest hypervisor) to operate on, letting it think that he's running on the actual hardware. The idea here is that in order for a guest hypervisor to operate and gain hardware-level privileges, it evokes a fault or a trap, this trap or fault is then handled or caught by the main host hypervisor and then inspected to see if its a legitimate or appropriate command or request, if it is, the host gives privilige to the guest, again having it think that its actually running on the main bare-metal hardware.

In this model, everything must go back to the main host hypervisor. Then the host hypervisor forwards the trap and virtualization specification to the above-level involved or responsible. For instance, if L0 runs L1. Then L1 attempts to run L2. Then the command to run L2 goes down to L0 and then L0 forwards this command to L1 again. This is the model we're interested in because this what x86 machines basically follow. Look at figure 1 in the paper for a better understanding of this.

==Main contribution==
The paper propose two new-developed techniques:
* Multi-dimensional paging (for memory virtualization)
* Multiple-level device management (for I/O virtualization)

Other contributions:
* Micro-optimizations to improve performance.

==Implementation==
The turtle project has four components that is crucial to its implementation.
* Nested VMX virtualization for nested CPU virtualization
* Multi-dimensional paging for nested MMU virtualization
* Multi-level device assignment for nested I/O virtualization
* Micro-Optimizations to make it go faster

How does the Nest VMX virtualization work:
L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle.
The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously.

How Multi-dimensional paging work:
The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

How does I/O virtualization work:
There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

How they implement the Micro-Optimizations to make it go faster:
The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing
vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by
privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

==Performance==
Two benchmarks were used: kernbench - compiles the linux kernel multiple times. SPECjbb - designed to measure server side [perofmance for Java run-time environments

Overhead for nested virtualization with kernbench is 10.3% and 6.3% for Specjbb.
There are two sources of overhead evident in nested virtualization. First, the transitions between L1 and L2 are slower than the transition in the lower level of the nested design (between L0 and L1). Second the code handling EXITs running on the host hypervisor such as L1 is much slower than the same code in L0.

The paper outlines optimization steps to achieve the minimal overhead.

1. Bypassing vmread and vmwrite instructions and directly accessing data under certain conditions. Removing the need to trap and emulate.

2. Optimizing exit handling code. (the main cause of the slowdown is provided by additional exits in the exit handling code.

==Critique==
'''The good:'''

--
The paper unequivocally demonstrates strong input in the area of virtualization and data sharing within a single machine. It is aimed at programmers and does not affect the end-user in clearly detectable deviation regarding the usage of applications on top this architecture. Nevertheless contribution is visible with respect to security and compatibility. Since this is first successful implementations of this type that does not modify hardware (there have been half efficient designs), we expect to see increased interest in nested integration model described above.--[[User:Praubic|Praubic]] 23:37, 2 December 2010 (UTC)

The framework makes for convenient testing and debugging due to the fact that hypervisors can function inconspicuously towards other nested hypervisors and VMs without being detected. Moreover the efficiency overhead is reduced to 6-10% per level thanks to optimizations such as ommited vmwrites and direct paging (multi level paging technique).

--

* From what I read so far, the research showed in the paper is probably the first to achieve efficent x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

'''NOTE''': They do mention a masters thesis by Berghmans (its citation 12 in the paper) that, if I understand it right, also talks about software only nested virtualization (they mention it in section 2 as well as in the video), but they claim it is inefficient because only the lowest level hypervisor is able to take advantage of hardware with virtualization support. In the turtles porject solution all levels of hypervisor can take advantage of any present virtualization support. --[[User:Mbingham|Mbingham]] 16:21, 2 December 2010 (UTC)

* security - being able to run other hypervisors without being detected

* testing, debugging - of hypervisors

* Writing, organization wise: They provide links and resources that can help give explanations to the concepts that they briefly touch upon

* Relatively low performance cost for each level. As mentioned in the video the the team successfully achieved a 6 to 10% performance overhead for each nesting level of OS.

* Thanks to several optimizations the efficiency is greatly improvement to acceptable level
- Bypassing vmread and vmwrite instructions and directly accessing data under certain conditions
- Optimizing exit handling code and consequently reducing number of exits.

'''The bad:'''

* lots of exits cause significant performance cost.

* Writing, Organization wise: Some concepts, such as the VMC's, are written such that you should already be familiar with how they work, or read the appropriate references for that section of the research project

* From quickly looking over their results section, it seems their tests are done at the L2 level, a guest with two hypervisors below it. I think it might have been useful to understand the limits of nesting if they did some tests at an even higher level of nesting, L4 or L5 or whatever, just to see what the effect is. --[[User:Mbingham|Mbingham]] 16:21, 2 December 2010 (UTC)

==References==
[1] http://www.haifux.org/lectures/225/ - '''Nested x86 Virtualization - Muli Ben-Yehuda'''

Talk:COMP 3000 Essay 2 2010 Question 9

2010-12-02T16:56:10Z

Mbingham: /* General discussion */

=Group members=

* Munther Hussain
* Jonathon Slonosky
* Michael Bingham
* Chris Sullivan
* Pawel Raubic

=Group work=
* Background concepts: Munther Hussain
* Research problem: Michael Bingham
* Contribution:
* Critique:

=General discussion=

Hey there, this is Munther. The prof said that we should be contacting each other to see whos still on board for the course. So please
if you read this, add your name to the list of members above. You can my find my contact info in my profile page by clicking my signature. We shall talk about the details and how we will approach this in the next few days --[[User:Hesperus|Hesperus]] 16:41, 12 November 2010 (UTC)

---------------------

Checked in -- JSlonosky

----------------------

Pawel has already contacted us so he still in for the course, that makes 3 of us. The other three members, please drop in and add your name. We need to confirm the members today by 1:00 pm. --[[User:Hesperus|Hesperus]] 12:18, 15 November 2010 (UTC)

----------------------

Checked in --[[User:Mbingham|Mbingham]] 15:08, 15 November 2010 (UTC)

---------------------

Checked in --[[User:Smcilroy|Smcilroy]] 17:03, 15 November 2010 (UTC)

---------------------
To the person above me (Smcilroy): I can see that you're assigned to group 7 and not this one. So did the prof move you to this group or something ? We haven't confirmed or emailed the prof yet, I will wait until 1:00 pm. --[[User:Hesperus|Hesperus]] 17:22, 15 November 2010 (UTC)

---------------------
Alright, so I just emailed the prof the list of members that have checked in so far (the names listed above plus Pawel Raubic),
Smcilroy: I still don't know whether you're in this group or not, though I don't see your name listed in the group assignments on the course webpage. To the other members: if you're still interested in doing the course, please drop in here and add your name or even email me, you can find my contact info in my profile page(just click my signature).

Personally speaking, I find the topic of this article (The Turtle Project) to be quite interesting and approachable, in fact we've
already been playing with VirtualBox and VMWare and such things, so we should be familiar with some of the concepts the article
approaches like nested-virtualization, hypervisors, supervisors, etc, things that we even covered in class and we can in fact test on our machines. I've already started reading the article, hopefully tonight we'll start posting some basic ideas or concepts and talk about the article in general. I will be in tomorrow's tutorial session in the 4th floor in case some of you guys want to get to know one another. --[[User:Hesperus|Hesperus]] 18:43, 15 November 2010 (UTC)
-----
Yeah, it looks pretty good to me. Unfortunately, I am attending Ozzy Osbourne on the 25th, so I'd like it if we could get ourselves organized early so I can get my part done and not letting it fall on you guys. Not that I would let that happen --JSlonosky 02:51, 16 November 2010 (UTC)
-----
Why waste your money on that old man ? I'd love to see Halford though, I'm sure he'll do some classic Priest material, haven't checked the new record yet, but the cover looks awful, definitely the worst and most ridiculous cover of the year. Anyways, enough music talk. I think we should get it done at least on 24th, we should leave the last day to do the editing and stuff. I removed Smcilroy from the members list, I think he checked in here by mistake because I can see him in group 7. So far, we're 5, still missing one member. --[[User:Hesperus|Hesperus]] 05:36, 16 November 2010 (UTC)
-----
Yeah that would be pretty sweet. I figured I might as well see him when I can; Since he is going to be dead soon. How is he not already? Alright well, the other member should show up soon, or I'd guess that we are a group of 5. --JSlonosky 16:37, 16 November 2010 (UTC)

-----
Hey dudes. I think we need to get going here.. the paper is due in 4 days. I just did the paper intro section (provided the title, authors, research labs, links, etc.). I have read the paper twice so far and will be spending the whole day working on the background concepts and the research problem sections.

I'm still not sure on how we should divide the work and sections among the members, especially regarding the research contribution and critique, I mean those sections should not be based or written from the perspective of one person, we all need to work and discuss those paper concepts together.

If anyone wants to add something, then please add but don't edit or alter the already existing content. Lets try to get as many thoughts/ideas as possible and then we will edit and filter the redundancy later. And lets make sure that we add summary comments to our edits to make it easier to keep track of everything.

Also, we're still missing one member: Shawn Hansen. Its weird because on last Wednesday's lab, the prof told me that he attended the lab and signed his name, so he should still be in the course. --[[User:Hesperus|Hesperus]] 18:07, 21 November 2010 (UTC)

---------
Yeah man. We really do need to get on this. Not going to ozzy so I got free time now. I am reading it again to refresh my memory of it and will put notes of what I think we can criticize about it and such. What kind of references do you think we will need? Similar papers etc?
If you need to a hold of me. Best way is through email. jslonosk@connect.Carleton.ca. And if that is still in our group but doesn't participate, too bad for him--JSlonosky 14:42, 22 November 2010 (UTC)
----------
The section on the related work has all the things we need to as far as other papers go. Also, I was able to find other research papers that are not mentioned in the paper. I will definitely be adding those paper by tonight. For the time being, I will handle the background concepts. I added a group work section below to keep track of whos doing what. I should get the background concept done hopefully by tonight. If anyone want to help with the other sections that would be great, please add your name to the section you want to handle below.

I added a general paper summary below just to illustrate the general idea behind each section. If anybody wants to add anything, feel free to do so. --[[User:Hesperus|Hesperus]] 18:55, 22 November 2010 (UTC)
-----------
I remember the prof mentioned the most important part of the paper is the Critique so we gotta focus on that altogether not just one person for sure.--[[User:Praubic|Praubic]] 19:22, 22 November 2010 (UTC)

-------------
Yeah absloutely, I agree. But first, lets pin down the crucial points. And then we can discuss them collectively. If anyone happens to come across what he thinks is good or bad, then you can add it below to the good/bad points. Maybe the group work idea is bad, but I just thought maybe if we each member focuses on a specific part in the beginning, we can maybe have a better overall idea of what the paper is about. --[[User:Hesperus|Hesperus]] 19:42, 22 November 2010 (UTC)
--------------
Ok, another thing I figured is that the paper doesn't directly hint at why nested virtualization is necessary? I posted a link in references and I'l try to research more into the purpose of nested virtualization.--[[User:Praubic|Praubic]] 19:45, 22 November 2010 (UTC)
--------------
Actually the paper does talk about that. Look at the first two paragraphs in the introduction section of the paper on page 1. But you're right, they don't really elaborate, I think its because its not the purpose or the aim of the paper in the first place. --[[User:Hesperus|Hesperus]] 20:31, 22 November 2010 (UTC)
--------------
The stuff that Michael provided are excellent. That was actually what I was planning on doing. I will start by defining virtualization, hypervisors, computer ring security, the need and uses of nested virtualization, the models, etc. --[[User:Hesperus|Hesperus]] 22:14, 22 November 2010 (UTC)
-------------
So here my question who doing what in the group work and where should I focus my attention to do my part?- Csulliva
-------------
I have posted few things regarding the background concepts on the main page. I will go back and edit it today and talk about other things like: nested virtualization, the need and advantages of NV, the models, the trap and emulate model of x86 machines, computer paging which is discussed in the paper, computer ring security which again they touch on at some point in the paper. I can easily move some of the things I wrote in the theory section to the main page, but I want to consult the prof first on some of those things.

One thing that I'm still unsure of is how far should we go here ? should we provide background on the hardware architecture used by the authors like the x86 family and the VMX chips, or maybe some of the concepts discussed later on in the testing such as optimization, emulation, para-virtualization ?

I will speak and consult the prof today after our lecture. If other members want to help, you guys can start with the related work and see how the content of the paper compares to previous or even current research papers. --[[User:Hesperus|Hesperus]] 08:08, 23 November 2010 (UTC)
------------------------
In response to what Michael mentioned above in the background section: we should definitely talk about that, from what I understood, they apply the same model (the trap and emulate) but they provide optimizations and ways to increase the trap calls efficiency between the nested environments, so thats definitely a contribution, but its more of a performance optimization kind of contribution I guess, which is why I mentioned the optimizations in the contribution section below. --[[User:Hesperus|Hesperus]] 08:08, 23 November 2010 (UTC)
---------------------------
'''Ok, so for those who didn't attend today's lecture, the prof was nice enough to give us an extension for the paper, the due date now is Dec 2nd.''' And thats really good, given that some of those concepts require time to sort of formulate. I also asked the prof on the approach that we should follow in terms of presenting the material, and he mentioned that you need to provide enough information for each section to make your follow student understand what the paper is about without them having to actually read the paper or go through it in detail. He also mentioned the need to distill some of the details, if the paper spends a whole page explaining multi-dimensional paging, we should probably explain that in 2 small paragraphs or something.

Also, we should always cite resources. If the resource is a book, we should cite the page number as well. --[[User:Hesperus|Hesperus]] 15:16, 23 November 2010 (UTC)
---------------------------
Yeah I am really thankful he left us with another week to do it. I am sure we all have at least 3 projects due soon, other than this Essay. I'll type up the stuff that I had highlighted for Tuesday as a break tomorrow. I was going to do it yesterday but he gave us an extension, so I slacked off a bit. I also forgot :/ --JSlonosky 23:43, 24 November 2010 (UTC)
---------------------------
Hey dudes. I have posted the first part of the backgrounds concept here in the discussion and on the main page as well. This is just a rough version, so I will be constantly expanding it and adding resources later on today. I have also created and added a diagram for illustration, as far as I know, we should be allowed to do this. If anyone have any suggestions to what I have posted or any counter arguments, please discuss. I will also be moving some of the stuff I wrote here (the theory section) to the main page as well.

Regarding the critique, I guess the excessive amount of exits can somehow be seen as a '''scalability''' constraint, maybe making the overall design somehow too complex or difficult to get a hold of, I'm not sure about this, but just guessing from a general programming point of view. I will email the prof today, maybe he can give us some hints for what can be considered a weakness or a bad spot if you will in the paper.

Also, we're still missing the sixth member of the group: Shawn Hansen. --[[User:Hesperus|Hesperus]] 06:57, 29 November 2010 (UTC)

----------------------------

Hey guys. I can start working on the research problem part of the essay. I'll put it up here when I have a rough version than move it to the actual article. As for the critique section, how about we put a section on the talk page here and people can add in what they thought worked/didn't work with some explanation/references, and then we can get someone/some people to combine it and put it in the essay?
--[[User:Mbingham|Mbingham]] 18:13, 29 November 2010 (UTC)

----------------------------

Yea really, great work on the Background. It's looking slick. I added some initial edit in the Contribution and Critique but I agree lets open a thread here and All collaborate. --[[User:Praubic|Praubic]] 18:24, 30 November 2010 (UTC)
----------------------------

Nice man. Sorry I haven't updated with anything that I have done yet, but I'll have it up later today or tomorrow. I got both an Essay and game dev project done for tomorrow, so after 1 I will be free to work on this until it is time for 3004--JSlonosky 13:41, 30 November 2010 (UTC)

----------------------------

I put up an initial version of the research problem section in the article. Let me know what you guys think. --[[User:Mbingham|Mbingham]] 19:53, 30 November 2010 (UTC)

----------------------------
Hey guys. Since I'm working on the backgrounds concepts and Michael is handling the research problem. The other members should handle the contribution part. I think everything we need for the contribution section is in section 3 of the article (3.1, 3.2, 3.3, 3.4, 3.5). You can also make use of the things we posted here. Just to be on the safe side, we need to get this done by tomorrow's night. I'm working on a couple of definitions as we speak and will hopefully be done by tomorrow's morning.

PS: We should leave the critique to the end, there should not be a lot of writing for that part and we must all contribute.

--[[User:Hesperus|Hesperus]] 01:45, 1 December 2010 (UTC)
-----------------------------
Just posted other bits that were missing in the backgrounds concepts section like the security uses, models of virtualization and para-virtualization. They're just a rough version however. I will edit them in the next few hours.I just need to write something for protection rings and that would be it I guess.

I can help with the other sections for the rest of the day, I will try to post some summaries for performance and implementation or even the related work. --[[User:Hesperus|Hesperus]] 07:26, 1 December 2010 (UTC)

-----------------------------
Guys, we need to get moving here.. The contribution section still needs a lot. We need to talk about their innovations and the things they did there:
CPU virtualization, Memory virtualization, I/O virtualization and the Macro-optimizations.

I will be posting something regarding this in the next few hours. --[[User:Hesperus|Hesperus]] 22:53, 1 December 2010 (UTC)

-----------------------------

I have looked over the paper again and I am wondering about some things. How are we to critique it? By their methods, or by the paper itself?
I find that in the organization of the paper, they give you the links and extra information to look more in depth on such things like the VMC technology, but they almost use that as an excuse for not explaining things in the paper.
The VMC(0 ->1) annotation that isn't explained. I understand what they mean, but it seems that they assume that you already know some things. --JSlonosky 03:03, 2 December 2010 (UTC)
-----------------
I think most research papers follow that kind of approach, they vaguely talk about the sideline things and provide references. The VMC technology from what I understood is just a creation of an environment to link or switch between hypervisors. --[[User:Hesperus|Hesperus]] 03:26, 2 December 2010 (UTC)

-----------------------------

The instructions say that both style and content can be critiqued. I guess the organization of the paper would fall under style, but i'm not sure how fair it is to critique how much they go in depth on certain things, especially some background stuff. After all, the audience of this paper is people who are already well versed in OS and virtualization stuff. That's not to say that we shouldn't bring it up, especially if we feel they don't sufficiently explain a new technique or notation they are using.

I think it's also important to remember that our critique will contain things they have done well, not just things they could have done better. Considering that this paper got the best paper award at the largest OS conference, I think it's safe to say our critique will have many more good things than bad.

Here's some things they have done well on first inspection, just to get some ideas out there:
* Solution is extensible to an arbitrary nesting depth without major loss of performance
* Solution doesn't depend on modified hardware or software (expect for the lowest level hypervisor) we can reference previous solutions that do require modifications
* The paper doesn't ignore virtualizing I/O devices to an arbitrary nesting depth, other techniques do
* I think the paper does well in laying out the theoretical approach to the problem, as well as demonstrating impressive empirical results.

I'll have some time to work on this tomorrow, probably clean up the research problem section, maybe kick off the contribution section if no one's started it, and put up some more extensive stuff for the critique. Let me know what you guys think, i'm off to bed pretty soon, haha! --[[User:Mbingham|Mbingham]] 03:41, 2 December 2010 (UTC)

-----------------------------

Okay, thanks for the clear up man. Sounds good. I'll see what else I can do in between other work I got to do tonight.
One thing we should remember is to make sure that our essay clearly answers the question that is directed to it on the exam review. If we get some other good ideas for questions, we should submit those to Anil as well.
Questions 1 and 2 relate to our essay, in my mind.
"What are two uses for nested virtual machines?
Multi-dimensional page tables are designed to avoid using shadow page tables in nested virtualization. What are shadow page tables, and when must they be used?"
--JSlonosky 04:47, 2 December 2010 (UTC)

-----------------------------
Hey guys. The points that Michael mentioned sound pretty great. I think the critique more or less depends on our understanding of the paper, so its not like theres a specific answer or something.
I will also be seeing the prof tomorrow in his office hours if anyone wants to join me, I will post something here before I go.

The backgrounds section is done. I will keep editing it and filter some of the information. I don't have a lot of things to do today, so I will spend the whole day working on the paper and editing it and adding the references. I added some sub-sections for the contributions section. The theory part should just talk about the way they're flattening the levels of virtualization and multiplexing the hardware, I will try to write something for this. Then we go into the CPU, Memory, I/O and optimization. And I can see that someone already handled those things here in the discussion. So we're pretty much done.

'''PS: Guys, please don;t forget about the references. We don't wanna get into any trouble with the prof in that regard.'''
--[[User:Hesperus|Hesperus]] 08:51, 2 December 2010 (UTC)
------------------
Alright, I will do some Contribution section today or tonight so no worries. The critique, as i said i had added some stuff there but we still need to debate the good and bad of the design as perceived by our opinions since its a critique we can use first person. "I" and "To me". --[[User:Praubic|Praubic]] 15:37, 2 December 2010 (UTC)
-------------------
Also, if each of us could contribute to the Critique part (here in the discussion) in point form and then we glue it together in concise sentences? We have to get straight to point. We are not aiming for length, rather content as you all know obviously. --[[User:Praubic|Praubic]] 15:53, 2 December 2010 (UTC)
--------------------
Actually the contributions section is outlined below in the implementation here in the discussion page. So whoever did that should edit it and take it to the main page. I'm ging to the office hours in 2 hours from now to ask the prof a couple of things incluing the critique. --[[User:Hesperus|Hesperus]] 15:58, 2 December 2010 (UTC)
--------------------
I was just looking over the background concepts section, and had a couple of questions. Firstly, would it be possible to maybe scale the image down and have the text flow around it? Right now it seems to break the "flow" a bit, if that makes sense. Secondly, I think maybe we should think about consolidating some of the sub headings and stuff, I think it breaks the flow of the paper if we have a whole bunch of sub headings that only have a couple of sentences of explanation. Also, I added some stuff to the critique section on the talk page here (right at the bottom). I'll add some more later. Let me know what you guys think, and let us know how the meeting with Anil goes Hesperus. If I have time I'll try to come, but i've got two other projects on the go right now too, haha. --[[User:Mbingham|Mbingham]] 16:56, 2 December 2010 (UTC)

=Paper summary=
==Background Concepts and Other Stuff==

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as virtual machine, this emulation which includes a guest hypervisor and a virtualized environment, only gives an illusion to the guest virtual machine to make it think that its running directly on the main hardware. In other words, we can view this virtual machine as an application running on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on hardware virtualization within operating systems environments.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guest virtual machines among one another, and the interaction with the host hardware and operating system. It also controls host resources.

====Nested virtualization====
Nested virtualization is the concept of recursively running one or more virtual machines inside one another. For instance, the main operating system (L1) runs a VM called L2, in turn, L2 runs another VM L3, L3 then runs L4 and so on.

====Para-virtualization====
[Coming....]

===Trap and emulate model===
A vitualization model based on the idea that when a guest hypervisor attempts to execute, gain or access privilged hardware context, it triggers a trap or a fault which gets handled or caught by the host hypervisor. The host hypervisor then determines whether this instruction should be allowed to execute or not. Then based on that, the host hypervisor provides an emulation of the requested outcome to the guest hypervisor. The x86 systems discussed in the Turtles Project research paper follows this model.

===The uses of nested virtualization===

====Compatibility====
A system could provide the user with a compatibility mode for other operatng systems or applications. An example of this would
be the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites such as NetFlix to host their API and database on Amazon's hardware.

====Security====
[Coming...]

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

===Protection rings===
[Coming....]

EDIT: Just noticed that someone has put their name down to do the background concept stuff, so Munther feel free to use this as a starting point if you like.

The above looks good. I thought id maybe start touching on some of the sections, so let me know what you guys think. Heres what I think would be useful to go over in the Background Concepts section:

* Firstly, nested virtualization. Why we use nested virtualization (paper gives example of XP inside win 7). Maybe going over the trap and emulate model of nested virtualization.
* Some of the terminology of nested virtualization. The difference between guest/host hypervisors (we're already familiar with guest/host OSs), the terminology of L0, ..., Ln with L0 being the bottom hypervisor, etc
* x86 nested virtualization limitations. Single level architecture, guest/host mode, VMX instructions and how to emulate them. Some of this is in section 3.2of the paper.

Again, anything else you guys think we should add would be great.

Commenting some more on the above summary, under the "main contributions" part, do you think we should count the nested VMX virtualization part as a contribution? If we have multiplexing memory and multiplexing I/O as a main contribution, it would seem to make sense to have multiplexing the CPU as well, especially within the limitations of the x86 architecture. Unless they are using someone else's technique for virtualizing these instructions.--[[User:Mbingham|Mbingham]] 21:16, 22 November 2010 (UTC)
==Research problem==
The paper provides a solution for Nested-virtualization on x86 based computers. Their approach is software-based, meaning that, they're not really altering the underlying architecture, and this is basically the most interesting thing about the paper, since x86 computers don't support nested-virtualization in terms of hardware, but apparently they were able to do it.

The goal of nested virtualization and multiple host hypervisors comes down to efficiency. Example: Virtualization on servers has been rapidly gaining popularity. The next evolution step is to extend a single level of memory management virtualization support to handle nested virtualization, which is critical for ''high performance''. [1]

'''How does the concept apply to the quickly developing cloud computing?'''

Cloud user manages his own virtual machine directly through a hypervisor of choice. In addition it provides increased security by hypervicsor-level intrusion detection.

==Related work==

Comparisons with other related/similar research and work:

Refer to the following website and to the related work section in the paper regarding this section:
http://www.spinics.net/lists/kvm/msg43940.html

[This is a forum post by one of the authors of our assigned paper where he talks about more recent research work on virtualization, particularly in his first paragraph, he refers to some more recent research by the VMWare technical support team. He also talks about some of the research papers referred to in our assigned paper.]

==Theory (Section 3.1)==

Apparently, theres 2 models to applying nested-virtualization:

* Multiple-level architecture support: where every hypervisor handles every other hypervisor running on top of it. For instance, if L0 (host hypervisor) runs L1. If L1 attempts to run L2, then the trap handling and the work needed to be done to allow L1 to instantiate a new VM is handled by L0. More generally, if L2 attempts to created its own VM, then L1 will handle the trap handling and such.

* Single-level architecture support: This is the model supported by the x86 machines. This model is tied into the concept of "Trap and emulate", where every hypervisor tries to emulate the underlying hardware (the VMX chip in the paper implementation) and presents a fake ground for the hypervisor running on top of it (the guest hypervisor) to operate on, letting it think that he's running on the actual hardware. The idea here is that in order for a guest hypervisor to operate and gain hardware-level privileges, it evokes a fault or a trap, this trap or fault is then handled or caught by the main host hypervisor and then inspected to see if its a legitimate or appropriate command or request, if it is, the host gives privilige to the guest, again having it think that its actually running on the main bare-metal hardware.

In this model, everything must go back to the main host hypervisor. Then the host hypervisor forwards the trap and virtualization specification to the above-level involved or responsible. For instance, if L0 runs L1. Then L1 attempts to run L2. Then the command to run L2 goes down to L0 and then L0 forwards this command to L1 again. This is the model we're interested in because this what x86 machines basically follow. Look at figure 1 in the paper for a better understanding of this.

==Main contribution==
The paper propose two new-developed techniques:
* Multi-dimensional paging (for memory virtualization)
* Multiple-level device management (for I/O virtualization)

Other contributions:
* Micro-optimizations to improve performance.

==Implementation==
The turtle project has four components that is crucial to its implementation.
* Nested VMX virtualization for nested CPU virtualization
* Multi-dimensional paging for nested MMU virtualization
* Multi-level device assignment for nested I/O virtualization
* Micro-Optimizations to make it go faster

How does the Nest VMX virtualization work:
L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle.
The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously.

How Multi-dimensional paging work:
The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

How does I/O virtualization work:
There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

How they implement the Micro-Optimizations to make it go faster:
The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing
vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by
privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

==Performance==
Two benchmarks were used: kernbench - compiles the linux kernel multiple times. SPECjbb - designed to measure server side [perofmance for Java run-time environments

Overhead for nested virtualization with kernbench is 10.3% and 6.3% for Specjbb.
There are two sources of overhead evident in nested virtualization. First, the transitions between L1 and L2 are slower than the transition in the lower level of the nested design (between L0 and L1). Second the code handling EXITs running on the host hypervisor such as L1 is much slower than the same code in L0.

The paper outlines optimization steps to achieve the minimal overhead.

1. Bypassing vmread and vmwrite instructions and directly accessing data under certain conditions. Removing the need to trap and emulate.

2. Optimizing exit handling code. (the main cause of the slowdown is provided by additional exits in the exit handling code.

==Critique==
'''The good:'''

* From what I read so far, the research showed in the paper is probably the first to achieve efficent x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

'''NOTE''': They do mention a masters thesis by Berghmans (its citation 12 in the paper) that, if I understand it right, also talks about software only nested virtualization (they mention it in section 2 as well as in the video), but they claim it is inefficient because only the lowest level hypervisor is able to take advantage of hardware with virtualization support. In the turtles porject solution all levels of hypervisor can take advantage of any present virtualization support. --[[User:Mbingham|Mbingham]] 16:21, 2 December 2010 (UTC)

* security - being able to run other hypervisors without being detected

* testing, debugging - of hypervisors

*Writing, organization wise: They provide links and resources that can help give explanations to the concepts that they briefly touch upon

'''The bad:'''

* lots of exits. to be continued. (anyone whos is interested feel free to take this topic)

*Writing, Organization wise: Some concepts, such as the VMC's, are written such that you should already be familiar with how they work, or read the appropriate references for that section of the research project

* From quickly looking over their results section, it seems their tests are done at the L2 level, a guest with two hypervisors below it. I think it might have been useful to understand the limits of nesting if they did some tests at an even higher level of nesting, L4 or L5 or whatever, just to see what the effect is. --[[User:Mbingham|Mbingham]] 16:21, 2 December 2010 (UTC)

==References==
[1] http://www.haifux.org/lectures/225/ - '''Nested x86 Virtualization - Muli Ben-Yehuda'''

COMP 3000 Essay 2 2010 Question 9

2010-12-02T16:46:58Z

Mbingham: /* Research problem */ expanding research problem section. Added new paragraph

Talk:COMP 3000 Essay 2 2010 Question 9

2010-12-02T16:21:56Z

Mbingham: /* Critique */ expanding on critique

=Group members=

* Munther Hussain
* Jonathon Slonosky
* Michael Bingham
* Chris Sullivan
* Pawel Raubic

=Group work=
* Background concepts: Munther Hussain
* Research problem: Michael Bingham
* Contribution:
* Critique:

=General discussion=

Hey there, this is Munther. The prof said that we should be contacting each other to see whos still on board for the course. So please
if you read this, add your name to the list of members above. You can my find my contact info in my profile page by clicking my signature. We shall talk about the details and how we will approach this in the next few days --[[User:Hesperus|Hesperus]] 16:41, 12 November 2010 (UTC)

---------------------

Checked in -- JSlonosky

----------------------

Pawel has already contacted us so he still in for the course, that makes 3 of us. The other three members, please drop in and add your name. We need to confirm the members today by 1:00 pm. --[[User:Hesperus|Hesperus]] 12:18, 15 November 2010 (UTC)

----------------------

Checked in --[[User:Mbingham|Mbingham]] 15:08, 15 November 2010 (UTC)

---------------------

Checked in --[[User:Smcilroy|Smcilroy]] 17:03, 15 November 2010 (UTC)

---------------------
To the person above me (Smcilroy): I can see that you're assigned to group 7 and not this one. So did the prof move you to this group or something ? We haven't confirmed or emailed the prof yet, I will wait until 1:00 pm. --[[User:Hesperus|Hesperus]] 17:22, 15 November 2010 (UTC)

---------------------
Alright, so I just emailed the prof the list of members that have checked in so far (the names listed above plus Pawel Raubic),
Smcilroy: I still don't know whether you're in this group or not, though I don't see your name listed in the group assignments on the course webpage. To the other members: if you're still interested in doing the course, please drop in here and add your name or even email me, you can find my contact info in my profile page(just click my signature).

Personally speaking, I find the topic of this article (The Turtle Project) to be quite interesting and approachable, in fact we've
already been playing with VirtualBox and VMWare and such things, so we should be familiar with some of the concepts the article
approaches like nested-virtualization, hypervisors, supervisors, etc, things that we even covered in class and we can in fact test on our machines. I've already started reading the article, hopefully tonight we'll start posting some basic ideas or concepts and talk about the article in general. I will be in tomorrow's tutorial session in the 4th floor in case some of you guys want to get to know one another. --[[User:Hesperus|Hesperus]] 18:43, 15 November 2010 (UTC)
-----
Yeah, it looks pretty good to me. Unfortunately, I am attending Ozzy Osbourne on the 25th, so I'd like it if we could get ourselves organized early so I can get my part done and not letting it fall on you guys. Not that I would let that happen --JSlonosky 02:51, 16 November 2010 (UTC)
-----
Why waste your money on that old man ? I'd love to see Halford though, I'm sure he'll do some classic Priest material, haven't checked the new record yet, but the cover looks awful, definitely the worst and most ridiculous cover of the year. Anyways, enough music talk. I think we should get it done at least on 24th, we should leave the last day to do the editing and stuff. I removed Smcilroy from the members list, I think he checked in here by mistake because I can see him in group 7. So far, we're 5, still missing one member. --[[User:Hesperus|Hesperus]] 05:36, 16 November 2010 (UTC)
-----
Yeah that would be pretty sweet. I figured I might as well see him when I can; Since he is going to be dead soon. How is he not already? Alright well, the other member should show up soon, or I'd guess that we are a group of 5. --JSlonosky 16:37, 16 November 2010 (UTC)

-----
Hey dudes. I think we need to get going here.. the paper is due in 4 days. I just did the paper intro section (provided the title, authors, research labs, links, etc.). I have read the paper twice so far and will be spending the whole day working on the background concepts and the research problem sections.

I'm still not sure on how we should divide the work and sections among the members, especially regarding the research contribution and critique, I mean those sections should not be based or written from the perspective of one person, we all need to work and discuss those paper concepts together.

If anyone wants to add something, then please add but don't edit or alter the already existing content. Lets try to get as many thoughts/ideas as possible and then we will edit and filter the redundancy later. And lets make sure that we add summary comments to our edits to make it easier to keep track of everything.

Also, we're still missing one member: Shawn Hansen. Its weird because on last Wednesday's lab, the prof told me that he attended the lab and signed his name, so he should still be in the course. --[[User:Hesperus|Hesperus]] 18:07, 21 November 2010 (UTC)

---------
Yeah man. We really do need to get on this. Not going to ozzy so I got free time now. I am reading it again to refresh my memory of it and will put notes of what I think we can criticize about it and such. What kind of references do you think we will need? Similar papers etc?
If you need to a hold of me. Best way is through email. jslonosk@connect.Carleton.ca. And if that is still in our group but doesn't participate, too bad for him--JSlonosky 14:42, 22 November 2010 (UTC)
----------
The section on the related work has all the things we need to as far as other papers go. Also, I was able to find other research papers that are not mentioned in the paper. I will definitely be adding those paper by tonight. For the time being, I will handle the background concepts. I added a group work section below to keep track of whos doing what. I should get the background concept done hopefully by tonight. If anyone want to help with the other sections that would be great, please add your name to the section you want to handle below.

I added a general paper summary below just to illustrate the general idea behind each section. If anybody wants to add anything, feel free to do so. --[[User:Hesperus|Hesperus]] 18:55, 22 November 2010 (UTC)
-----------
I remember the prof mentioned the most important part of the paper is the Critique so we gotta focus on that altogether not just one person for sure.--[[User:Praubic|Praubic]] 19:22, 22 November 2010 (UTC)

-------------
Yeah absloutely, I agree. But first, lets pin down the crucial points. And then we can discuss them collectively. If anyone happens to come across what he thinks is good or bad, then you can add it below to the good/bad points. Maybe the group work idea is bad, but I just thought maybe if we each member focuses on a specific part in the beginning, we can maybe have a better overall idea of what the paper is about. --[[User:Hesperus|Hesperus]] 19:42, 22 November 2010 (UTC)
--------------
Ok, another thing I figured is that the paper doesn't directly hint at why nested virtualization is necessary? I posted a link in references and I'l try to research more into the purpose of nested virtualization.--[[User:Praubic|Praubic]] 19:45, 22 November 2010 (UTC)
--------------
Actually the paper does talk about that. Look at the first two paragraphs in the introduction section of the paper on page 1. But you're right, they don't really elaborate, I think its because its not the purpose or the aim of the paper in the first place. --[[User:Hesperus|Hesperus]] 20:31, 22 November 2010 (UTC)
--------------
The stuff that Michael provided are excellent. That was actually what I was planning on doing. I will start by defining virtualization, hypervisors, computer ring security, the need and uses of nested virtualization, the models, etc. --[[User:Hesperus|Hesperus]] 22:14, 22 November 2010 (UTC)
-------------
So here my question who doing what in the group work and where should I focus my attention to do my part?- Csulliva
-------------
I have posted few things regarding the background concepts on the main page. I will go back and edit it today and talk about other things like: nested virtualization, the need and advantages of NV, the models, the trap and emulate model of x86 machines, computer paging which is discussed in the paper, computer ring security which again they touch on at some point in the paper. I can easily move some of the things I wrote in the theory section to the main page, but I want to consult the prof first on some of those things.

One thing that I'm still unsure of is how far should we go here ? should we provide background on the hardware architecture used by the authors like the x86 family and the VMX chips, or maybe some of the concepts discussed later on in the testing such as optimization, emulation, para-virtualization ?

I will speak and consult the prof today after our lecture. If other members want to help, you guys can start with the related work and see how the content of the paper compares to previous or even current research papers. --[[User:Hesperus|Hesperus]] 08:08, 23 November 2010 (UTC)
------------------------
In response to what Michael mentioned above in the background section: we should definitely talk about that, from what I understood, they apply the same model (the trap and emulate) but they provide optimizations and ways to increase the trap calls efficiency between the nested environments, so thats definitely a contribution, but its more of a performance optimization kind of contribution I guess, which is why I mentioned the optimizations in the contribution section below. --[[User:Hesperus|Hesperus]] 08:08, 23 November 2010 (UTC)
---------------------------
'''Ok, so for those who didn't attend today's lecture, the prof was nice enough to give us an extension for the paper, the due date now is Dec 2nd.''' And thats really good, given that some of those concepts require time to sort of formulate. I also asked the prof on the approach that we should follow in terms of presenting the material, and he mentioned that you need to provide enough information for each section to make your follow student understand what the paper is about without them having to actually read the paper or go through it in detail. He also mentioned the need to distill some of the details, if the paper spends a whole page explaining multi-dimensional paging, we should probably explain that in 2 small paragraphs or something.

Also, we should always cite resources. If the resource is a book, we should cite the page number as well. --[[User:Hesperus|Hesperus]] 15:16, 23 November 2010 (UTC)
---------------------------
Yeah I am really thankful he left us with another week to do it. I am sure we all have at least 3 projects due soon, other than this Essay. I'll type up the stuff that I had highlighted for Tuesday as a break tomorrow. I was going to do it yesterday but he gave us an extension, so I slacked off a bit. I also forgot :/ --JSlonosky 23:43, 24 November 2010 (UTC)
---------------------------
Hey dudes. I have posted the first part of the backgrounds concept here in the discussion and on the main page as well. This is just a rough version, so I will be constantly expanding it and adding resources later on today. I have also created and added a diagram for illustration, as far as I know, we should be allowed to do this. If anyone have any suggestions to what I have posted or any counter arguments, please discuss. I will also be moving some of the stuff I wrote here (the theory section) to the main page as well.

Regarding the critique, I guess the excessive amount of exits can somehow be seen as a '''scalability''' constraint, maybe making the overall design somehow too complex or difficult to get a hold of, I'm not sure about this, but just guessing from a general programming point of view. I will email the prof today, maybe he can give us some hints for what can be considered a weakness or a bad spot if you will in the paper.

Also, we're still missing the sixth member of the group: Shawn Hansen. --[[User:Hesperus|Hesperus]] 06:57, 29 November 2010 (UTC)

----------------------------

Hey guys. I can start working on the research problem part of the essay. I'll put it up here when I have a rough version than move it to the actual article. As for the critique section, how about we put a section on the talk page here and people can add in what they thought worked/didn't work with some explanation/references, and then we can get someone/some people to combine it and put it in the essay?
--[[User:Mbingham|Mbingham]] 18:13, 29 November 2010 (UTC)

----------------------------

Yea really, great work on the Background. It's looking slick. I added some initial edit in the Contribution and Critique but I agree lets open a thread here and All collaborate. --[[User:Praubic|Praubic]] 18:24, 30 November 2010 (UTC)
----------------------------

Nice man. Sorry I haven't updated with anything that I have done yet, but I'll have it up later today or tomorrow. I got both an Essay and game dev project done for tomorrow, so after 1 I will be free to work on this until it is time for 3004--JSlonosky 13:41, 30 November 2010 (UTC)

----------------------------

I put up an initial version of the research problem section in the article. Let me know what you guys think. --[[User:Mbingham|Mbingham]] 19:53, 30 November 2010 (UTC)

----------------------------
Hey guys. Since I'm working on the backgrounds concepts and Michael is handling the research problem. The other members should handle the contribution part. I think everything we need for the contribution section is in section 3 of the article (3.1, 3.2, 3.3, 3.4, 3.5). You can also make use of the things we posted here. Just to be on the safe side, we need to get this done by tomorrow's night. I'm working on a couple of definitions as we speak and will hopefully be done by tomorrow's morning.

PS: We should leave the critique to the end, there should not be a lot of writing for that part and we must all contribute.

--[[User:Hesperus|Hesperus]] 01:45, 1 December 2010 (UTC)
-----------------------------
Just posted other bits that were missing in the backgrounds concepts section like the security uses, models of virtualization and para-virtualization. They're just a rough version however. I will edit them in the next few hours.I just need to write something for protection rings and that would be it I guess.

I can help with the other sections for the rest of the day, I will try to post some summaries for performance and implementation or even the related work. --[[User:Hesperus|Hesperus]] 07:26, 1 December 2010 (UTC)

-----------------------------
Guys, we need to get moving here.. The contribution section still needs a lot. We need to talk about their innovations and the things they did there:
CPU virtualization, Memory virtualization, I/O virtualization and the Macro-optimizations.

I will be posting something regarding this in the next few hours. --[[User:Hesperus|Hesperus]] 22:53, 1 December 2010 (UTC)

-----------------------------

I have looked over the paper again and I am wondering about some things. How are we to critique it? By their methods, or by the paper itself?
I find that in the organization of the paper, they give you the links and extra information to look more in depth on such things like the VMC technology, but they almost use that as an excuse for not explaining things in the paper.
The VMC(0 ->1) annotation that isn't explained. I understand what they mean, but it seems that they assume that you already know some things. --JSlonosky 03:03, 2 December 2010 (UTC)
-----------------
I think most research papers follow that kind of approach, they vaguely talk about the sideline things and provide references. The VMC technology from what I understood is just a creation of an environment to link or switch between hypervisors. --[[User:Hesperus|Hesperus]] 03:26, 2 December 2010 (UTC)

-----------------------------

The instructions say that both style and content can be critiqued. I guess the organization of the paper would fall under style, but i'm not sure how fair it is to critique how much they go in depth on certain things, especially some background stuff. After all, the audience of this paper is people who are already well versed in OS and virtualization stuff. That's not to say that we shouldn't bring it up, especially if we feel they don't sufficiently explain a new technique or notation they are using.

I think it's also important to remember that our critique will contain things they have done well, not just things they could have done better. Considering that this paper got the best paper award at the largest OS conference, I think it's safe to say our critique will have many more good things than bad.

Here's some things they have done well on first inspection, just to get some ideas out there:
* Solution is extensible to an arbitrary nesting depth without major loss of performance
* Solution doesn't depend on modified hardware or software (expect for the lowest level hypervisor) we can reference previous solutions that do require modifications
* The paper doesn't ignore virtualizing I/O devices to an arbitrary nesting depth, other techniques do
* I think the paper does well in laying out the theoretical approach to the problem, as well as demonstrating impressive empirical results.

I'll have some time to work on this tomorrow, probably clean up the research problem section, maybe kick off the contribution section if no one's started it, and put up some more extensive stuff for the critique. Let me know what you guys think, i'm off to bed pretty soon, haha! --[[User:Mbingham|Mbingham]] 03:41, 2 December 2010 (UTC)

-----------------------------

Okay, thanks for the clear up man. Sounds good. I'll see what else I can do in between other work I got to do tonight.
One thing we should remember is to make sure that our essay clearly answers the question that is directed to it on the exam review. If we get some other good ideas for questions, we should submit those to Anil as well.
Questions 1 and 2 relate to our essay, in my mind.
"What are two uses for nested virtual machines?
Multi-dimensional page tables are designed to avoid using shadow page tables in nested virtualization. What are shadow page tables, and when must they be used?"
--JSlonosky 04:47, 2 December 2010 (UTC)

-----------------------------
Hey guys. The points that Michael mentioned sound pretty great. I think the critique more or less depends on our understanding of the paper, so its not like theres a specific answer or something.
I will also be seeing the prof tomorrow in his office hours if anyone wants to join me, I will post something here before I go.

The backgrounds section is done. I will keep editing it and filter some of the information. I don't have a lot of things to do today, so I will spend the whole day working on the paper and editing it and adding the references. I added some sub-sections for the contributions section. The theory part should just talk about the way they're flattening the levels of virtualization and multiplexing the hardware, I will try to write something for this. Then we go into the CPU, Memory, I/O and optimization. And I can see that someone already handled those things here in the discussion. So we're pretty much done.

'''PS: Guys, please don;t forget about the references. We don't wanna get into any trouble with the prof in that regard.'''
--[[User:Hesperus|Hesperus]] 08:51, 2 December 2010 (UTC)
------------------
Alright, I will do some Contribution section today or tonight so no worries. The critique, as i said i had added some stuff there but we still need to debate the good and bad of the design as perceived by our opinions since its a critique we can use first person. "I" and "To me". --[[User:Praubic|Praubic]] 15:37, 2 December 2010 (UTC)
-------------------
Also, if each of us could contribute to the Critique part (here in the discussion) in point form and then we glue it together in concise sentences? We have to get straight to point. We are not aiming for length, rather content as you all know obviously. --[[User:Praubic|Praubic]] 15:53, 2 December 2010 (UTC)
--------------------
Actually the contributions section is outlined below in the implementation here in the discussion page. So whoever did that should edit it and take it to the main page. I'm ging to the office hours in 2 hours from now to ask the prof a couple of things incluing the critique. --[[User:Hesperus|Hesperus]] 15:58, 2 December 2010 (UTC)

=Paper summary=
==Background Concepts and Other Stuff==

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as virtual machine, this emulation which includes a guest hypervisor and a virtualized environment, only gives an illusion to the guest virtual machine to make it think that its running directly on the main hardware. In other words, we can view this virtual machine as an application running on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on hardware virtualization within operating systems environments.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guest virtual machines among one another, and the interaction with the host hardware and operating system. It also controls host resources.

====Nested virtualization====
Nested virtualization is the concept of recursively running one or more virtual machines inside one another. For instance, the main operating system (L1) runs a VM called L2, in turn, L2 runs another VM L3, L3 then runs L4 and so on.

====Para-virtualization====
[Coming....]

===Trap and emulate model===
A vitualization model based on the idea that when a guest hypervisor attempts to execute, gain or access privilged hardware context, it triggers a trap or a fault which gets handled or caught by the host hypervisor. The host hypervisor then determines whether this instruction should be allowed to execute or not. Then based on that, the host hypervisor provides an emulation of the requested outcome to the guest hypervisor. The x86 systems discussed in the Turtles Project research paper follows this model.

===The uses of nested virtualization===

====Compatibility====
A system could provide the user with a compatibility mode for other operatng systems or applications. An example of this would
be the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites such as NetFlix to host their API and database on Amazon's hardware.

====Security====
[Coming...]

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

===Protection rings===
[Coming....]

EDIT: Just noticed that someone has put their name down to do the background concept stuff, so Munther feel free to use this as a starting point if you like.

The above looks good. I thought id maybe start touching on some of the sections, so let me know what you guys think. Heres what I think would be useful to go over in the Background Concepts section:

* Firstly, nested virtualization. Why we use nested virtualization (paper gives example of XP inside win 7). Maybe going over the trap and emulate model of nested virtualization.
* Some of the terminology of nested virtualization. The difference between guest/host hypervisors (we're already familiar with guest/host OSs), the terminology of L0, ..., Ln with L0 being the bottom hypervisor, etc
* x86 nested virtualization limitations. Single level architecture, guest/host mode, VMX instructions and how to emulate them. Some of this is in section 3.2of the paper.

Again, anything else you guys think we should add would be great.

Commenting some more on the above summary, under the "main contributions" part, do you think we should count the nested VMX virtualization part as a contribution? If we have multiplexing memory and multiplexing I/O as a main contribution, it would seem to make sense to have multiplexing the CPU as well, especially within the limitations of the x86 architecture. Unless they are using someone else's technique for virtualizing these instructions.--[[User:Mbingham|Mbingham]] 21:16, 22 November 2010 (UTC)
==Research problem==
The paper provides a solution for Nested-virtualization on x86 based computers. Their approach is software-based, meaning that, they're not really altering the underlying architecture, and this is basically the most interesting thing about the paper, since x86 computers don't support nested-virtualization in terms of hardware, but apparently they were able to do it.

The goal of nested virtualization and multiple host hypervisors comes down to efficiency. Example: Virtualization on servers has been rapidly gaining popularity. The next evolution step is to extend a single level of memory management virtualization support to handle nested virtualization, which is critical for ''high performance''. [1]

'''How does the concept apply to the quickly developing cloud computing?'''

Cloud user manages his own virtual machine directly through a hypervisor of choice. In addition it provides increased security by hypervicsor-level intrusion detection.

==Related work==

Comparisons with other related/similar research and work:

Refer to the following website and to the related work section in the paper regarding this section:
http://www.spinics.net/lists/kvm/msg43940.html

[This is a forum post by one of the authors of our assigned paper where he talks about more recent research work on virtualization, particularly in his first paragraph, he refers to some more recent research by the VMWare technical support team. He also talks about some of the research papers referred to in our assigned paper.]

==Theory (Section 3.1)==

Apparently, theres 2 models to applying nested-virtualization:

* Multiple-level architecture support: where every hypervisor handles every other hypervisor running on top of it. For instance, if L0 (host hypervisor) runs L1. If L1 attempts to run L2, then the trap handling and the work needed to be done to allow L1 to instantiate a new VM is handled by L0. More generally, if L2 attempts to created its own VM, then L1 will handle the trap handling and such.

* Single-level architecture support: This is the model supported by the x86 machines. This model is tied into the concept of "Trap and emulate", where every hypervisor tries to emulate the underlying hardware (the VMX chip in the paper implementation) and presents a fake ground for the hypervisor running on top of it (the guest hypervisor) to operate on, letting it think that he's running on the actual hardware. The idea here is that in order for a guest hypervisor to operate and gain hardware-level privileges, it evokes a fault or a trap, this trap or fault is then handled or caught by the main host hypervisor and then inspected to see if its a legitimate or appropriate command or request, if it is, the host gives privilige to the guest, again having it think that its actually running on the main bare-metal hardware.

In this model, everything must go back to the main host hypervisor. Then the host hypervisor forwards the trap and virtualization specification to the above-level involved or responsible. For instance, if L0 runs L1. Then L1 attempts to run L2. Then the command to run L2 goes down to L0 and then L0 forwards this command to L1 again. This is the model we're interested in because this what x86 machines basically follow. Look at figure 1 in the paper for a better understanding of this.

==Main contribution==
The paper propose two new-developed techniques:
* Multi-dimensional paging (for memory virtualization)
* Multiple-level device management (for I/O virtualization)

Other contributions:
* Micro-optimizations to improve performance.

==Implementation==
The turtle project has four components that is crucial to its implementation.
* Nested VMX virtualization for nested CPU virtualization
* Multi-dimensional paging for nested MMU virtualization
* Multi-level device assignment for nested I/O virtualization
* Micro-Optimizations to make it go faster

How does the Nest VMX virtualization work:
L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle.
The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously.

How Multi-dimensional paging work:
The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

How does I/O virtualization work:
There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

How they implement the Micro-Optimizations to make it go faster:
The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing
vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by
privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

==Performance==
Two benchmarks were used: kernbench - compiles the linux kernel multiple times. SPECjbb - designed to measure server side [perofmance for Java run-time environments

Overhead for nested virtualization with kernbench is 10.3% and 6.3% for Specjbb.
There are two sources of overhead evident in nested virtualization. First, the transitions between L1 and L2 are slower than the transition in the lower level of the nested design (between L0 and L1). Second the code handling EXITs running on the host hypervisor such as L1 is much slower than the same code in L0.

The paper outlines optimization steps to achieve the minimal overhead.

1. Bypassing vmread and vmwrite instructions and directly accessing data under certain conditions. Removing the need to trap and emulate.

2. Optimizing exit handling code. (the main cause of the slowdown is provided by additional exits in the exit handling code.

==Critique==
'''The good:'''

* From what I read so far, the research showed in the paper is probably the first to achieve efficent x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

'''NOTE''': They do mention a masters thesis by Berghmans (its citation 12 in the paper) that, if I understand it right, also talks about software only nested virtualization (they mention it in section 2 as well as in the video), but they claim it is inefficient because only the lowest level hypervisor is able to take advantage of hardware with virtualization support. In the turtles porject solution all levels of hypervisor can take advantage of any present virtualization support. --[[User:Mbingham|Mbingham]] 16:21, 2 December 2010 (UTC)

* security - being able to run other hypervisors without being detected

* testing, debugging - of hypervisors

*Writing, organization wise: They provide links and resources that can help give explanations to the concepts that they briefly touch upon

'''The bad:'''

* lots of exits. to be continued. (anyone whos is interested feel free to take this topic)

*Writing, Organization wise: Some concepts, such as the VMC's, are written such that you should already be familiar with how they work, or read the appropriate references for that section of the research project

* From quickly looking over their results section, it seems their tests are done at the L2 level, a guest with two hypervisors below it. I think it might have been useful to understand the limits of nesting if they did some tests at an even higher level of nesting, L4 or L5 or whatever, just to see what the effect is. --[[User:Mbingham|Mbingham]] 16:21, 2 December 2010 (UTC)

==References==
[1] http://www.haifux.org/lectures/225/ - '''Nested x86 Virtualization - Muli Ben-Yehuda'''

Talk:COMP 3000 Essay 2 2010 Question 9

2010-12-02T03:41:58Z

Mbingham: /* General discussion */

=Group members=

* Munther Hussain
* Jonathon Slonosky
* Michael Bingham
* Chris Sullivan
* Pawel Raubic

=Group work=
* Background concepts: Munther Hussain
* Research problem: Michael Bingham
* Contribution:
* Critique:

=General discussion=

Hey there, this is Munther. The prof said that we should be contacting each other to see whos still on board for the course. So please
if you read this, add your name to the list of members above. You can my find my contact info in my profile page by clicking my signature. We shall talk about the details and how we will approach this in the next few days --[[User:Hesperus|Hesperus]] 16:41, 12 November 2010 (UTC)

---------------------

Checked in -- JSlonosky

----------------------

Pawel has already contacted us so he still in for the course, that makes 3 of us. The other three members, please drop in and add your name. We need to confirm the members today by 1:00 pm. --[[User:Hesperus|Hesperus]] 12:18, 15 November 2010 (UTC)

----------------------

Checked in --[[User:Mbingham|Mbingham]] 15:08, 15 November 2010 (UTC)

---------------------

Checked in --[[User:Smcilroy|Smcilroy]] 17:03, 15 November 2010 (UTC)

---------------------
To the person above me (Smcilroy): I can see that you're assigned to group 7 and not this one. So did the prof move you to this group or something ? We haven't confirmed or emailed the prof yet, I will wait until 1:00 pm. --[[User:Hesperus|Hesperus]] 17:22, 15 November 2010 (UTC)

---------------------
Alright, so I just emailed the prof the list of members that have checked in so far (the names listed above plus Pawel Raubic),
Smcilroy: I still don't know whether you're in this group or not, though I don't see your name listed in the group assignments on the course webpage. To the other members: if you're still interested in doing the course, please drop in here and add your name or even email me, you can find my contact info in my profile page(just click my signature).

Personally speaking, I find the topic of this article (The Turtle Project) to be quite interesting and approachable, in fact we've
already been playing with VirtualBox and VMWare and such things, so we should be familiar with some of the concepts the article
approaches like nested-virtualization, hypervisors, supervisors, etc, things that we even covered in class and we can in fact test on our machines. I've already started reading the article, hopefully tonight we'll start posting some basic ideas or concepts and talk about the article in general. I will be in tomorrow's tutorial session in the 4th floor in case some of you guys want to get to know one another. --[[User:Hesperus|Hesperus]] 18:43, 15 November 2010 (UTC)
-----
Yeah, it looks pretty good to me. Unfortunately, I am attending Ozzy Osbourne on the 25th, so I'd like it if we could get ourselves organized early so I can get my part done and not letting it fall on you guys. Not that I would let that happen --JSlonosky 02:51, 16 November 2010 (UTC)
-----
Why waste your money on that old man ? I'd love to see Halford though, I'm sure he'll do some classic Priest material, haven't checked the new record yet, but the cover looks awful, definitely the worst and most ridiculous cover of the year. Anyways, enough music talk. I think we should get it done at least on 24th, we should leave the last day to do the editing and stuff. I removed Smcilroy from the members list, I think he checked in here by mistake because I can see him in group 7. So far, we're 5, still missing one member. --[[User:Hesperus|Hesperus]] 05:36, 16 November 2010 (UTC)
-----
Yeah that would be pretty sweet. I figured I might as well see him when I can; Since he is going to be dead soon. How is he not already? Alright well, the other member should show up soon, or I'd guess that we are a group of 5. --JSlonosky 16:37, 16 November 2010 (UTC)

-----
Hey dudes. I think we need to get going here.. the paper is due in 4 days. I just did the paper intro section (provided the title, authors, research labs, links, etc.). I have read the paper twice so far and will be spending the whole day working on the background concepts and the research problem sections.

I'm still not sure on how we should divide the work and sections among the members, especially regarding the research contribution and critique, I mean those sections should not be based or written from the perspective of one person, we all need to work and discuss those paper concepts together.

If anyone wants to add something, then please add but don't edit or alter the already existing content. Lets try to get as many thoughts/ideas as possible and then we will edit and filter the redundancy later. And lets make sure that we add summary comments to our edits to make it easier to keep track of everything.

Also, we're still missing one member: Shawn Hansen. Its weird because on last Wednesday's lab, the prof told me that he attended the lab and signed his name, so he should still be in the course. --[[User:Hesperus|Hesperus]] 18:07, 21 November 2010 (UTC)

---------
Yeah man. We really do need to get on this. Not going to ozzy so I got free time now. I am reading it again to refresh my memory of it and will put notes of what I think we can criticize about it and such. What kind of references do you think we will need? Similar papers etc?
If you need to a hold of me. Best way is through email. jslonosk@connect.Carleton.ca. And if that is still in our group but doesn't participate, too bad for him--JSlonosky 14:42, 22 November 2010 (UTC)
----------
The section on the related work has all the things we need to as far as other papers go. Also, I was able to find other research papers that are not mentioned in the paper. I will definitely be adding those paper by tonight. For the time being, I will handle the background concepts. I added a group work section below to keep track of whos doing what. I should get the background concept done hopefully by tonight. If anyone want to help with the other sections that would be great, please add your name to the section you want to handle below.

I added a general paper summary below just to illustrate the general idea behind each section. If anybody wants to add anything, feel free to do so. --[[User:Hesperus|Hesperus]] 18:55, 22 November 2010 (UTC)
-----------
I remember the prof mentioned the most important part of the paper is the Critique so we gotta focus on that altogether not just one person for sure.--[[User:Praubic|Praubic]] 19:22, 22 November 2010 (UTC)

-------------
Yeah absloutely, I agree. But first, lets pin down the crucial points. And then we can discuss them collectively. If anyone happens to come across what he thinks is good or bad, then you can add it below to the good/bad points. Maybe the group work idea is bad, but I just thought maybe if we each member focuses on a specific part in the beginning, we can maybe have a better overall idea of what the paper is about. --[[User:Hesperus|Hesperus]] 19:42, 22 November 2010 (UTC)
--------------
Ok, another thing I figured is that the paper doesn't directly hint at why nested virtualization is necessary? I posted a link in references and I'l try to research more into the purpose of nested virtualization.--[[User:Praubic|Praubic]] 19:45, 22 November 2010 (UTC)
--------------
Actually the paper does talk about that. Look at the first two paragraphs in the introduction section of the paper on page 1. But you're right, they don't really elaborate, I think its because its not the purpose or the aim of the paper in the first place. --[[User:Hesperus|Hesperus]] 20:31, 22 November 2010 (UTC)
--------------
The stuff that Michael provided are excellent. That was actually what I was planning on doing. I will start by defining virtualization, hypervisors, computer ring security, the need and uses of nested virtualization, the models, etc. --[[User:Hesperus|Hesperus]] 22:14, 22 November 2010 (UTC)
-------------
So here my question who doing what in the group work and where should I focus my attention to do my part?- Csulliva
-------------
I have posted few things regarding the background concepts on the main page. I will go back and edit it today and talk about other things like: nested virtualization, the need and advantages of NV, the models, the trap and emulate model of x86 machines, computer paging which is discussed in the paper, computer ring security which again they touch on at some point in the paper. I can easily move some of the things I wrote in the theory section to the main page, but I want to consult the prof first on some of those things.

One thing that I'm still unsure of is how far should we go here ? should we provide background on the hardware architecture used by the authors like the x86 family and the VMX chips, or maybe some of the concepts discussed later on in the testing such as optimization, emulation, para-virtualization ?

I will speak and consult the prof today after our lecture. If other members want to help, you guys can start with the related work and see how the content of the paper compares to previous or even current research papers. --[[User:Hesperus|Hesperus]] 08:08, 23 November 2010 (UTC)
------------------------
In response to what Michael mentioned above in the background section: we should definitely talk about that, from what I understood, they apply the same model (the trap and emulate) but they provide optimizations and ways to increase the trap calls efficiency between the nested environments, so thats definitely a contribution, but its more of a performance optimization kind of contribution I guess, which is why I mentioned the optimizations in the contribution section below. --[[User:Hesperus|Hesperus]] 08:08, 23 November 2010 (UTC)
---------------------------
'''Ok, so for those who didn't attend today's lecture, the prof was nice enough to give us an extension for the paper, the due date now is Dec 2nd.''' And thats really good, given that some of those concepts require time to sort of formulate. I also asked the prof on the approach that we should follow in terms of presenting the material, and he mentioned that you need to provide enough information for each section to make your follow student understand what the paper is about without them having to actually read the paper or go through it in detail. He also mentioned the need to distill some of the details, if the paper spends a whole page explaining multi-dimensional paging, we should probably explain that in 2 small paragraphs or something.

Also, we should always cite resources. If the resource is a book, we should cite the page number as well. --[[User:Hesperus|Hesperus]] 15:16, 23 November 2010 (UTC)
---------------------------
Yeah I am really thankful he left us with another week to do it. I am sure we all have at least 3 projects due soon, other than this Essay. I'll type up the stuff that I had highlighted for Tuesday as a break tomorrow. I was going to do it yesterday but he gave us an extension, so I slacked off a bit. I also forgot :/ --JSlonosky 23:43, 24 November 2010 (UTC)
---------------------------
Hey dudes. I have posted the first part of the backgrounds concept here in the discussion and on the main page as well. This is just a rough version, so I will be constantly expanding it and adding resources later on today. I have also created and added a diagram for illustration, as far as I know, we should be allowed to do this. If anyone have any suggestions to what I have posted or any counter arguments, please discuss. I will also be moving some of the stuff I wrote here (the theory section) to the main page as well.

Regarding the critique, I guess the excessive amount of exits can somehow be seen as a '''scalability''' constraint, maybe making the overall design somehow too complex or difficult to get a hold of, I'm not sure about this, but just guessing from a general programming point of view. I will email the prof today, maybe he can give us some hints for what can be considered a weakness or a bad spot if you will in the paper.

Also, we're still missing the sixth member of the group: Shawn Hansen. --[[User:Hesperus|Hesperus]] 06:57, 29 November 2010 (UTC)

----------------------------

Hey guys. I can start working on the research problem part of the essay. I'll put it up here when I have a rough version than move it to the actual article. As for the critique section, how about we put a section on the talk page here and people can add in what they thought worked/didn't work with some explanation/references, and then we can get someone/some people to combine it and put it in the essay?
--[[User:Mbingham|Mbingham]] 18:13, 29 November 2010 (UTC)

----------------------------

Yea really, great work on the Background. It's looking slick. I added some initial edit in the Contribution and Critique but I agree lets open a thread here and All collaborate. --[[User:Praubic|Praubic]] 18:24, 30 November 2010 (UTC)
----------------------------

Nice man. Sorry I haven't updated with anything that I have done yet, but I'll have it up later today or tomorrow. I got both an Essay and game dev project done for tomorrow, so after 1 I will be free to work on this until it is time for 3004--JSlonosky 13:41, 30 November 2010 (UTC)

----------------------------

I put up an initial version of the research problem section in the article. Let me know what you guys think. --[[User:Mbingham|Mbingham]] 19:53, 30 November 2010 (UTC)

----------------------------
Hey guys. Since I'm working on the backgrounds concepts and Michael is handling the research problem. The other members should handle the contribution part. I think everything we need for the contribution section is in section 3 of the article (3.1, 3.2, 3.3, 3.4, 3.5). You can also make use of the things we posted here. Just to be on the safe side, we need to get this done by tomorrow's night. I'm working on a couple of definitions as we speak and will hopefully be done by tomorrow's morning.

PS: We should leave the critique to the end, there should not be a lot of writing for that part and we must all contribute.

--[[User:Hesperus|Hesperus]] 01:45, 1 December 2010 (UTC)
-----------------------------
Just posted other bits that were missing in the backgrounds concepts section like the security uses, models of virtualization and para-virtualization. They're just a rough version however. I will edit them in the next few hours.I just need to write something for protection rings and that would be it I guess.

I can help with the other sections for the rest of the day, I will try to post some summaries for performance and implementation or even the related work. --[[User:Hesperus|Hesperus]] 07:26, 1 December 2010 (UTC)

-----------------------------
Guys, we need to get moving here.. The contribution section still needs a lot. We need to talk about their innovations and the things they did there:
CPU virtualization, Memory virtualization, I/O virtualization and the Macro-optimizations.

I will be posting something regarding this in the next few hours. --[[User:Hesperus|Hesperus]] 22:53, 1 December 2010 (UTC)

-----------------------------

I have looked over the paper again and I am wondering about some things. How are we to critique it? By their methods, or by the paper itself?
I find that in the organization of the paper, they give you the links and extra information to look more in depth on such things like the VMC technology, but they almost use that as an excuse for not explaining things in the paper.
The VMC(0 ->1) annotation that isn't explained. I understand what they mean, but it seems that they assume that you already know some things. --JSlonosky 03:03, 2 December 2010 (UTC)
-----------------
I think most research papers follow that kind of approach, they vaguely talk about the sideline things and provide references. The VMC technology from what I understood is just a creation of an environment to link or switch between hypervisors. --[[User:Hesperus|Hesperus]] 03:26, 2 December 2010 (UTC)

-----------------------------

The instructions say that both style and content can be critiqued. I guess the organization of the paper would fall under style, but i'm not sure how fair it is to critique how much they go in depth on certain things, especially some background stuff. After all, the audience of this paper is people who are already well versed in OS and virtualization stuff. That's not to say that we shouldn't bring it up, especially if we feel they don't sufficiently explain a new technique or notation they are using.

I think it's also important to remember that our critique will contain things they have done well, not just things they could have done better. Considering that this paper got the best paper award at the largest OS conference, I think it's safe to say our critique will have many more good things than bad.

Here's some things they have done well on first inspection, just to get some ideas out there:
* Solution is extensible to an arbitrary nesting depth without major loss of performance
* Solution doesn't depend on modified hardware or software (expect for the lowest level hypervisor) we can reference previous solutions that do require modifications
* The paper doesn't ignore virtualizing I/O devices to an arbitrary nesting depth, other techniques do
* I think the paper does well in laying out the theoretical approach to the problem, as well as demonstrating impressive empirical results.

I'll have some time to work on this tomorrow, probably clean up the research problem section, maybe kick off the contribution section if no one's started it, and put up some more extensive stuff for the critique. Let me know what you guys think, i'm off to bed pretty soon, haha! --[[User:Mbingham|Mbingham]] 03:41, 2 December 2010 (UTC)

-----------------------------

=Paper summary=
==Background Concepts and Other Stuff==

===Virtualization===

In essence, virtualization is creating an emulation of the underlying hardware for a guest operating system, program or a process to operate on. [1] Usually referred to as virtual machine, this emulation which includes a guest hypervisor and a virtualized environment, only gives an illusion to the guest virtual machine to make it think that its running directly on the main hardware. In other words, we can view this virtual machine as an application running on the host OS.

The term virtualization has become rather broad, associated with a number of areas where this technology is used like data virtualization, storage virtualization, mobile virtualization and network virtualization. For the purposes and context of our assigned paper, we shall focus our attention on hardware virtualization within operating systems environments.

====Hypervisor====
Also referred to as VMM (Virtual machine monitor), is a software module that exists one level above the supervisor and runs directly on the bare hardware to monitor the execution and behaviour of the guest virtual machines. The main task of the hypervior is to provide an emulation of the underlying hardware (CPU, memory, I/O, drivers, etc.) to the guest virtual machines and to take care of the possible issues that may rise due to the interaction of those guest virtual machines among one another, and the interaction with the host hardware and operating system. It also controls host resources.

====Nested virtualization====
Nested virtualization is the concept of recursively running one or more virtual machines inside one another. For instance, the main operating system (L1) runs a VM called L2, in turn, L2 runs another VM L3, L3 then runs L4 and so on.

====Para-virtualization====
[Coming....]

===Trap and emulate model===
A vitualization model based on the idea that when a guest hypervisor attempts to execute, gain or access privilged hardware context, it triggers a trap or a fault which gets handled or caught by the host hypervisor. The host hypervisor then determines whether this instruction should be allowed to execute or not. Then based on that, the host hypervisor provides an emulation of the requested outcome to the guest hypervisor. The x86 systems discussed in the Turtles Project research paper follows this model.

===The uses of nested virtualization===

====Compatibility====
A system could provide the user with a compatibility mode for other operatng systems or applications. An example of this would
be the Windows XP mode thats available in Windows 7, where Windows 7 runs Windows XP as a virtual machine.

====Cloud computing====
A cloud provider, more fomally referred to as Infrastructure-as-a-Service (IAAS) provider, could use nested virtualization to give the ability to customers to host their own preferred user-controlled hypervisors and run their virtual machines on the provider hardware. This way both sides can benefit, the provider can attract customers and the customer can have freedom implementing its system on the host hardware without worrying about compatibility issues.

The most well known example of an IAAS provider is Amazon Web Services (AWS). AWS presents a virtualized platform for other services and web sites such as NetFlix to host their API and database on Amazon's hardware.

====Security====
[Coming...]

====Migration/Transfer of VMs====
Nested virtualization can also be used in live migration or transfer of virtual machines in cases of upgrade or disaster
recovery. Consider a scenarion where a number of virtual machines must be moved to a new hardware server for upgrade, instead of having to move each VM sepertaely, we can nest those virtual machines and their hypervisors to create one nested entity thats easier to deal with and more manageable.
In the last couple of years, virtualization packages such as VMWare and VirtualBox have adapted this notion of live migration and developed their own embedded migration/transfer agents.

====Testing====
Using virtual machines is convenient for testing, evaluation and bechmarking purposes. Since a virtual machine is essentially
a file on the host operating system, if corrupted or damaged, it can easily be removed, recreated or even restored since we can
can create a snapshot of the running virtual machine.

===Protection rings===
[Coming....]

EDIT: Just noticed that someone has put their name down to do the background concept stuff, so Munther feel free to use this as a starting point if you like.

The above looks good. I thought id maybe start touching on some of the sections, so let me know what you guys think. Heres what I think would be useful to go over in the Background Concepts section:

* Firstly, nested virtualization. Why we use nested virtualization (paper gives example of XP inside win 7). Maybe going over the trap and emulate model of nested virtualization.
* Some of the terminology of nested virtualization. The difference between guest/host hypervisors (we're already familiar with guest/host OSs), the terminology of L0, ..., Ln with L0 being the bottom hypervisor, etc
* x86 nested virtualization limitations. Single level architecture, guest/host mode, VMX instructions and how to emulate them. Some of this is in section 3.2of the paper.

Again, anything else you guys think we should add would be great.

Commenting some more on the above summary, under the "main contributions" part, do you think we should count the nested VMX virtualization part as a contribution? If we have multiplexing memory and multiplexing I/O as a main contribution, it would seem to make sense to have multiplexing the CPU as well, especially within the limitations of the x86 architecture. Unless they are using someone else's technique for virtualizing these instructions.--[[User:Mbingham|Mbingham]] 21:16, 22 November 2010 (UTC)
==Research problem==
The paper provides a solution for Nested-virtualization on x86 based computers. Their approach is software-based, meaning that, they're not really altering the underlying architecture, and this is basically the most interesting thing about the paper, since x86 computers don't support nested-virtualization in terms of hardware, but apparently they were able to do it.

The goal of nested virtualization and multiple host hypervisors comes down to efficiency. Example: Virtualization on servers has been rapidly gaining popularity. The next evolution step is to extend a single level of memory management virtualization support to handle nested virtualization, which is critical for ''high performance''. [1]

'''How does the concept apply to the quickly developing cloud computing?'''

Cloud user manages his own virtual machine directly through a hypervisor of choice. In addition it provides increased security by hypervicsor-level intrusion detection.

==Related work==

Comparisons with other related/similar research and work:

Refer to the following website and to the related work section in the paper regarding this section:
http://www.spinics.net/lists/kvm/msg43940.html

[This is a forum post by one of the authors of our assigned paper where he talks about more recent research work on virtualization, particularly in his first paragraph, he refers to some more recent research by the VMWare technical support team. He also talks about some of the research papers referred to in our assigned paper.]

==Theory (Section 3.1)==

Apparently, theres 2 models to applying nested-virtualization:

* Multiple-level architecture support: where every hypervisor handles every other hypervisor running on top of it. For instance, if L0 (host hypervisor) runs L1. If L1 attempts to run L2, then the trap handling and the work needed to be done to allow L1 to instantiate a new VM is handled by L0. More generally, if L2 attempts to created its own VM, then L1 will handle the trap handling and such.

* Single-level architecture support: This is the model supported by the x86 machines. This model is tied into the concept of "Trap and emulate", where every hypervisor tries to emulate the underlying hardware (the VMX chip in the paper implementation) and presents a fake ground for the hypervisor running on top of it (the guest hypervisor) to operate on, letting it think that he's running on the actual hardware. The idea here is that in order for a guest hypervisor to operate and gain hardware-level privileges, it evokes a fault or a trap, this trap or fault is then handled or caught by the main host hypervisor and then inspected to see if its a legitimate or appropriate command or request, if it is, the host gives privilige to the guest, again having it think that its actually running on the main bare-metal hardware.

In this model, everything must go back to the main host hypervisor. Then the host hypervisor forwards the trap and virtualization specification to the above-level involved or responsible. For instance, if L0 runs L1. Then L1 attempts to run L2. Then the command to run L2 goes down to L0 and then L0 forwards this command to L1 again. This is the model we're interested in because this what x86 machines basically follow. Look at figure 1 in the paper for a better understanding of this.

==Main contribution==
The paper propose two new-developed techniques:
* Multi-dimensional paging (for memory virtualization)
* Multiple-level device management (for I/O virtualization)

Other contributions:
* Micro-optimizations to improve performance.

==Implementation==
The turtle project has four components that is crucial to its implementation.
* Nested VMX virtualization for nested CPU virtualization
* Multi-dimensional paging for nested MMU virtualization
* Multi-level device assignment for nested I/O virtualization
* Micro-Optimizations to make it go faster

How does the Nest VMX virtualization work:
L0(the lowest most hypervisor) runs L1 with VMCS0->1(virtual machine control structure).The VMCS is the fundamental data structure that hypervisor per pars, describing the virtual machine, which is passed along to the CPU to be executed. L1(also a hypervisor) prepares VMCS1->2 to run its own virtual machine which executes vmlaunch. vmlaunch will trap and L0 will have the handle the tape because L1 is running as a virtual machine do to the fact that L0 is using the architectural mod for a hypervisor. So in order to have multiplexing happen by making L2 run as a virtual machine of L1. So L0 merges VMCS's; VMCS0->1 merges with VMCS1->2 to become VMCS0->2(enabling L0 to run L2 directly). L0 will now launch a L2 which cause it to trap. L0 handles the trap itself or will forward it to L1 depending if it L1 virtual machines responsibility to handle.
The way it handles a single L2 exit, L1 need to read and write to the VMCS disable interrupts which wouldn't normally be a problem but because it running in guest mode as a virtual machine all the operation trap leading to a signal high level L2 exit or L3 exit causes many exits(more exits less performance). Problem was corrected by making the single exit fast and reduced frequency of exits with multi-dimensional paging. In the end L1 or L0 base on the trap will finish handling it and resumes L2. this Process is repeated over again contentiously.

How Multi-dimensional paging work:
The main idea with n = 2 nest virtualization there are three logical translations: L2 to Virtual to physical address, from an L2 physical to L1 physical and form a L1 physical to L0 physical address. 3 levels of translations however there is only 2 MMU page table in the Hardware that called EPT; which takes virtual to physical and guest physical to host physical. They compress the 3 translations onto the two tables going from the being to end in 2 hopes instead of 3. This is done by shadow page table for the virtual machine and shadow-on-EPT. The Shadow-on-EPT compress three logical translations to two pages. The EPT tables rarely changer were the guest page table changes frequently. L0 emulates EPT for L1 and it uses EPT0->1 and EPT1->2 to construct EPT0->2. this process results in less Exits.

How does I/O virtualization work:
There are 3 fundamental way to virtual machine access the I/O, Device emulation(sugerman01), Para-virtualized drivers which know it on a driver(Barham03, Russell08) and Direct device assignment( evasseur04,Yassour08) which results in the best performance. to get the best performance they used a IOMMU for safe DMA bypass. With nested 3X3 options for I/O virtualization they had the many options but they used multi-level device assignment giving L2 guest direct access to L0 devices bypassing both L0 and L1. To do this they had to memory map I/O with program I/0 with DMA with interrupts. the idea with DMA is that each of hiperyzisor L0,L1 need to used a IOMMU to allow its virtual machine to access the device to bypass safety. There is only one plate for IOMMU so L0 need to emulates an IOMMU. then L0 compress the Multiple IOMMU into a single hardware IOMMU page table so that L2 programs the device directly. the Device DMA's are stored into L2 memory space directly

How they implement the Micro-Optimizations to make it go faster:
The two main places where guest of a nested hypervisor is slower then the same guest running on a baremetal hypervisor are the second transition between L1 to L2 and the exit handling code running on the L1 hypervirsor. Since L1 and L2 are assumed to be unmodified required charges were found in L0 only. They Optimized the transitions between L1 and L2. This involves an exit to L0 and then an entry. In L0 the most time is spent in merging VMC's, so they optimize this by copying data between VMC's if its being modified and they carefully balance full copying versus partial copying and tracking. The VMCS are optimized further by copying multiple VMCS fields at once. Normally by intel's specification read or writes must be performed using vmread and vmwrite instruction (operate on a single field). VMCS data can be accessed without ill side-effects by bypassing
vmread and vmwrite and copying multiple fields at once with large memory copies. (might not work on processors other than the ones they have tested).The main cause of this slowdown exit handling are additional exits caused by
privileged instructions in the exit-handling code. vmread and vmwrite are used by the hypervisor to change the guest and host specification ( causing L1 exit multiple times while it handles a single l2 exit). By using AMD SVM the guest and host specifications can be read or written to directly using ordinary memory loads and stores( L0 does not intervene while L1 modifies L2 specifications).

==Performance==
Two benchmarks were used: kernbench - compiles the linux kernel multiple times. SPECjbb - designed to measure server side [perofmance for Java run-time environments

Overhead for nested virtualization with kernbench is 10.3% and 6.3% for Specjbb.
There are two sources of overhead evident in nested virtualization. First, the transitions between L1 and L2 are slower than the transition in the lower level of the nested design (between L0 and L1). Second the code handling EXITs running on the host hypervisor such as L1 is much slower than the same code in L0.

The paper outlines optimization steps to achieve the minimal overhead.

1. Bypassing vmread and vmwrite instructions and directly accessing data under certain conditions. Removing the need to trap and emulate.

2. Optimizing exit handling code. (the main cause of the slowdown is provided by additional exits in the exit handling code.

==Critique==
'''The good:'''

* From what I read so far, the research showed in the paper is probably the first to achieve efficent x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

* security - being able to run other hypervisors without being detected

* testing, debugging - of hypervisors

*Writing, organization wise: They provide links and resources that can help give explanations to the concepts that they briefly touch upon

'''The bad:'''

* lots of exits. to be continued. (anyone whos is interested feel free to take this topic)

*Writing, Organization wise: Some concepts, such as the VMC's, are written such that you should already be familiar with how they work, or read the appropriate references for that section of the research project

==References==
[1] http://www.haifux.org/lectures/225/ - '''Nested x86 Virtualization - Muli Ben-Yehuda'''

Talk:COMP 3000 Essay 2 2010 Question 9

2010-11-30T19:53:33Z

Mbingham: /* General discussion */

COMP 3000 Essay 2 2010 Question 9

2010-11-30T19:51:51Z

Mbingham: Rough draft of research problem

Talk:COMP 3000 Essay 2 2010 Question 9

2010-11-29T18:13:05Z

Mbingham: general discussion

Talk:COMP 3000 Essay 2 2010 Question 9

2010-11-22T21:40:44Z

Mbingham: /* Background Concepts and Other Stuff */ small edit

=Group members=

* Munther Hussain
* Jonathon Slonosky
* Michael Bingham
* Chris Sullivan
* Pawel Raubic

=Group work=
* Background concepts: Munther Hussain
* Research problem:
* Contribution:
* Critique:

=General discussion=

Hey there, this is Munther. The prof said that we should be contacting each other to see whos still on board for the course. So please
if you read this, add your name to the list of members above. You can my find my contact info in my profile page by clicking my signature. We shall talk about the details and how we will approach this in the next few days --[[User:Hesperus|Hesperus]] 16:41, 12 November 2010 (UTC)

---------------------

Checked in -- JSlonosky

----------------------

Pawel has already contacted us so he still in for the course, that makes 3 of us. The other three members, please drop in and add your name. We need to confirm the members today by 1:00 pm. --[[User:Hesperus|Hesperus]] 12:18, 15 November 2010 (UTC)

----------------------

Checked in --[[User:Mbingham|Mbingham]] 15:08, 15 November 2010 (UTC)

---------------------

Checked in --[[User:Smcilroy|Smcilroy]] 17:03, 15 November 2010 (UTC)

---------------------
To the person above me (Smcilroy): I can see that you're assigned to group 7 and not this one. So did the prof move you to this group or something ? We haven't confirmed or emailed the prof yet, I will wait until 1:00 pm. --[[User:Hesperus|Hesperus]] 17:22, 15 November 2010 (UTC)

---------------------
Alright, so I just emailed the prof the list of members that have checked in so far (the names listed above plus Pawel Raubic),
Smcilroy: I still don't know whether you're in this group or not, though I don't see your name listed in the group assignments on the course webpage. To the other members: if you're still interested in doing the course, please drop in here and add your name or even email me, you can find my contact info in my profile page(just click my signature).

Personally speaking, I find the topic of this article (The Turtle Project) to be quite interesting and approachable, in fact we've
already been playing with VirtualBox and VMWare and such things, so we should be familiar with some of the concepts the article
approaches like nested-virtualization, hypervisors, supervisors, etc, things that we even covered in class and we can in fact test on our machines. I've already started reading the article, hopefully tonight we'll start posting some basic ideas or concepts and talk about the article in general. I will be in tomorrow's tutorial session in the 4th floor in case some of you guys want to get to know one another. --[[User:Hesperus|Hesperus]] 18:43, 15 November 2010 (UTC)
-----
Yeah, it looks pretty good to me. Unfortunately, I am attending Ozzy Osbourne on the 25th, so I'd like it if we could get ourselves organized early so I can get my part done and not letting it fall on you guys. Not that I would let that happen --JSlonosky 02:51, 16 November 2010 (UTC)
-----
Why waste your money on that old man ? I'd love to see Halford though, I'm sure he'll do some classic Priest material, haven't checked the new record yet, but the cover looks awful, definitely the worst and most ridiculous cover of the year. Anyways, enough music talk. I think we should get it done at least on 24th, we should leave the last day to do the editing and stuff. I removed Smcilroy from the members list, I think he checked in here by mistake because I can see him in group 7. So far, we're 5, still missing one member. --[[User:Hesperus|Hesperus]] 05:36, 16 November 2010 (UTC)
-----
Yeah that would be pretty sweet. I figured I might as well see him when I can; Since he is going to be dead soon. How is he not already? Alright well, the other member should show up soon, or I'd guess that we are a group of 5. --JSlonosky 16:37, 16 November 2010 (UTC)

-----
Hey dudes. I think we need to get going here.. the paper is due in 4 days. I just did the paper intro section (provided the title, authors, research labs, links, etc.). I have read the paper twice so far and will be spending the whole day working on the background concepts and the research problem sections.

I'm still not sure on how we should divide the work and sections among the members, especially regarding the research contribution and critique, I mean those sections should not be based or written from the perspective of one person, we all need to work and discuss those paper concepts together.

If anyone wants to add something, then please add but don't edit or alter the already existing content. Lets try to get as many thoughts/ideas as possible and then we will edit and filter the redundancy later. And lets make sure that we add summary comments to our edits to make it easier to keep track of everything.

Also, we're still missing one member: Shawn Hansen. Its weird because on last Wednesday's lab, the prof told me that he attended the lab and signed his name, so he should still be in the course. --[[User:Hesperus|Hesperus]] 18:07, 21 November 2010 (UTC)

---------
Yeah man. We really do need to get on this. Not going to ozzy so I got free time now. I am reading it again to refresh my memory of it and will put notes of what I think we can criticize about it and such. What kind of references do you think we will need? Similar papers etc?
If you need to a hold of me. Best way is through email. jslonosk@connect.Carleton.ca. And if that is still in our group but doesn't participate, too bad for him--JSlonosky 14:42, 22 November 2010 (UTC)
----------
The section on the related work has all the things we need to as far as other papers go. Also, I was able to find other research papers that are not mentioned in the paper. I will definitely be adding those paper by tonight. For the time being, I will handle the background concepts. I added a group work section below to keep track of whos doing what. I should get the background concept done hopefully by tonight. If anyone want to help with the other sections that would be great, please add your name to the section you want to handle below.

I added a general paper summary below just to illustrate the general idea behind each section. If anybody wants to add anything, feel free to do so. --[[User:Hesperus|Hesperus]] 18:55, 22 November 2010 (UTC)
-----------
I remember the prof mentioned the most important part of the paper is the Critique so we gotta focus on that altogether not just one person for sure.--[[User:Praubic|Praubic]] 19:22, 22 November 2010 (UTC)

-------------
Yeah absloutely, I agree. But first, lets pin down the crucial points. And then we can discuss them collectively. If anyone happens to come across what he thinks is good or bad, then you can add it below to the good/bad points. Maybe the group work idea is bad, but I just thought maybe if we each member focuses on a specific part in the beginning, we can maybe have a better overall idea of what the paper is about. --[[User:Hesperus|Hesperus]] 19:42, 22 November 2010 (UTC)
--------------
Ok, another thing I figured is that the paper doesn't directly hint at why nested virtualization is necessary? I posted a link in references and I'l try to research more into the purpose of nested virtualization.--[[User:Praubic|Praubic]] 19:45, 22 November 2010 (UTC)
--------------
Actually the paper does talk about that. Look at the first two paragraphs in the introduction section of the paper on page 1. But you're right, they don't really elaborate, I think its because its not the purpose or the aim of the paper in the first place. --[[User:Hesperus|Hesperus]] 20:31, 22 November 2010 (UTC)

=Paper summary=

==The idea/goal==
The paper provides a solution for Nested-virtualization on x86 based computers. Their approach is software-based, meaning that, they're not really altering the underlying architecture, and this is basically the most interesting thing about the paper, since x86 computers don't support nested-virtualization in terms of hardware. But apparently they were able to do it.
In addition, generally, nested virtualization is not supported on x86 systems (the architecture is not designed with that in mind) but for example vista runs XP VM under the covers when running xp programs which shows ability for parallel virtualization on a single hypervisor.

The goal of nested virtualization and multiple host hypervisors comes down to efficiency. Example: Virtualization on servers has been rapidly gaining popularity. The next evolution step is to extend a single level of memory management virtualization support to handle nested virtualization, which is critical for ''high performance''. [1]

----

==Related work==

==Theory (Section 3.1)==

Apparently, theres 2 models to applying nested-virtualization:

* Multiple-level architecture support: where every hypervisor handles every other hypervisor running on top of it. For instance, if L0 (host hypervisor) runs L1. If L1 attempts to run L2, then the trap handling and the work needed to be done to allow L1 to instantiate a new VM is handled by L0. More generally, if L2 attempts to created its own VM, then L1 will handle the trap handling and such.

* Single-level architecture support: This is the model supported by the x86 machines. This model is tied into the concept of "Trap and emulate", where every hypervisor tries to emulate the underlying hardware (the VMX chip in the paper implementation) and presents a fake ground for the hypervisor running on top of it (the guest hypervisor) to operate on, letting it think that he's running on the actual hardware. The idea here is that in order for a guest hypervisor to operate and gain hardware-level privileges, it evokes a fault or a trap, this trap or fault is then handled or caught by the main host hypervisor and then inspected to see if its a legitimate or appropriate command or request, if it is, the host gives privilige to the guest, again having it think that its actually running on the main bare-metal hardware.

In this model, everything must go back to the main host hypervisor. Then the hosy hypervisor forwards the trap and virtualization specification to the above-level involved or responsible. For instance, if L0 runs L1. Then L1 attempts to run L2. Then the command to run L2 goes down to L0 and then L0 forwards this command to L1 again. This is the model we're interested in because this what x86 machines basically follow. Look at figure 1 in the paper for a better understanding of this.

==Main contribution==
The paper propose two new-developed techniques:
* Multi-dimensional paging (for memory virtualization)
* Multiple-level device management (for I/O virtualization)

Other contributions:
* Micro-optimizations to improve performance.

==Implementation==

==Performance==

==Critique==
'''The good:'''

* From what I read so far, the research showed in the paper is probably the first to achieve efficent x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

'''The bad:'''

==References==
[1] http://www.haifux.org/lectures/225/ - '''Nested x86 Virtualization - Muli Ben-Yehuda'''

==Background Concepts and Other Stuff==

EDIT: Just noticed that someone has put their name down to do the background concept stuff, so Munther feel free to use this as a starting point if you like.

The above looks good. I thought id maybe start touching on some of the sections, so let me know what you guys think. Heres what I think would be useful to go over in the Background Concepts section:

* Firstly, nested virtualization. Why we use nested virtualization (paper gives example of XP inside win 7). Maybe going over the trap and emulate model of nested virtualization.
* Some of the terminology of nested virtualization. The difference between guest/host hypervisors (we're already familiar with guest/host OSs), the terminology of L0, ..., Ln with L0 being the bottom hypervisor, etc
* x86 nested virtualization limitations. Single level architecture, guest/host mode, VMX instructions and how to emulate them. Some of this is in section 3.2of the paper.

Again, anything else you guys think we should add would be great.

Commenting some more on the above summary, under the "main contributions" part, do you think we should count the nested VMX virtualization part as a contribution? If we have multiplexing memory and multiplexing I/O as a main contribution, it would seem to make sense to have multiplexing the CPU as well, especially within the limitations of the x86 architecture. Unless they are using someone else's technique for virtualizing these instructions.--[[User:Mbingham|Mbingham]] 21:16, 22 November 2010 (UTC)

Talk:COMP 3000 Essay 2 2010 Question 9

2010-11-22T21:16:25Z

Mbingham: /* Paper summary */ Added some stuff on background concepts

=Group members=

* Munther Hussain
* Jonathon Slonosky
* Michael Bingham
* Chris Sullivan
* Pawel Raubic

=Group work=
* Background concepts: Munther Hussain
* Research problem:
* Contribution:
* Critique:

=General discussion=

Hey there, this is Munther. The prof said that we should be contacting each other to see whos still on board for the course. So please
if you read this, add your name to the list of members above. You can my find my contact info in my profile page by clicking my signature. We shall talk about the details and how we will approach this in the next few days --[[User:Hesperus|Hesperus]] 16:41, 12 November 2010 (UTC)

---------------------

Checked in -- JSlonosky

----------------------

Pawel has already contacted us so he still in for the course, that makes 3 of us. The other three members, please drop in and add your name. We need to confirm the members today by 1:00 pm. --[[User:Hesperus|Hesperus]] 12:18, 15 November 2010 (UTC)

----------------------

Checked in --[[User:Mbingham|Mbingham]] 15:08, 15 November 2010 (UTC)

---------------------

Checked in --[[User:Smcilroy|Smcilroy]] 17:03, 15 November 2010 (UTC)

---------------------
To the person above me (Smcilroy): I can see that you're assigned to group 7 and not this one. So did the prof move you to this group or something ? We haven't confirmed or emailed the prof yet, I will wait until 1:00 pm. --[[User:Hesperus|Hesperus]] 17:22, 15 November 2010 (UTC)

---------------------
Alright, so I just emailed the prof the list of members that have checked in so far (the names listed above plus Pawel Raubic),
Smcilroy: I still don't know whether you're in this group or not, though I don't see your name listed in the group assignments on the course webpage. To the other members: if you're still interested in doing the course, please drop in here and add your name or even email me, you can find my contact info in my profile page(just click my signature).

Personally speaking, I find the topic of this article (The Turtle Project) to be quite interesting and approachable, in fact we've
already been playing with VirtualBox and VMWare and such things, so we should be familiar with some of the concepts the article
approaches like nested-virtualization, hypervisors, supervisors, etc, things that we even covered in class and we can in fact test on our machines. I've already started reading the article, hopefully tonight we'll start posting some basic ideas or concepts and talk about the article in general. I will be in tomorrow's tutorial session in the 4th floor in case some of you guys want to get to know one another. --[[User:Hesperus|Hesperus]] 18:43, 15 November 2010 (UTC)
-----
Yeah, it looks pretty good to me. Unfortunately, I am attending Ozzy Osbourne on the 25th, so I'd like it if we could get ourselves organized early so I can get my part done and not letting it fall on you guys. Not that I would let that happen --JSlonosky 02:51, 16 November 2010 (UTC)
-----
Why waste your money on that old man ? I'd love to see Halford though, I'm sure he'll do some classic Priest material, haven't checked the new record yet, but the cover looks awful, definitely the worst and most ridiculous cover of the year. Anyways, enough music talk. I think we should get it done at least on 24th, we should leave the last day to do the editing and stuff. I removed Smcilroy from the members list, I think he checked in here by mistake because I can see him in group 7. So far, we're 5, still missing one member. --[[User:Hesperus|Hesperus]] 05:36, 16 November 2010 (UTC)
-----
Yeah that would be pretty sweet. I figured I might as well see him when I can; Since he is going to be dead soon. How is he not already? Alright well, the other member should show up soon, or I'd guess that we are a group of 5. --JSlonosky 16:37, 16 November 2010 (UTC)

-----
Hey dudes. I think we need to get going here.. the paper is due in 4 days. I just did the paper intro section (provided the title, authors, research labs, links, etc.). I have read the paper twice so far and will be spending the whole day working on the background concepts and the research problem sections.

I'm still not sure on how we should divide the work and sections among the members, especially regarding the research contribution and critique, I mean those sections should not be based or written from the perspective of one person, we all need to work and discuss those paper concepts together.

If anyone wants to add something, then please add but don't edit or alter the already existing content. Lets try to get as many thoughts/ideas as possible and then we will edit and filter the redundancy later. And lets make sure that we add summary comments to our edits to make it easier to keep track of everything.

Also, we're still missing one member: Shawn Hansen. Its weird because on last Wednesday's lab, the prof told me that he attended the lab and signed his name, so he should still be in the course. --[[User:Hesperus|Hesperus]] 18:07, 21 November 2010 (UTC)

---------
Yeah man. We really do need to get on this. Not going to ozzy so I got free time now. I am reading it again to refresh my memory of it and will put notes of what I think we can criticize about it and such. What kind of references do you think we will need? Similar papers etc?
If you need to a hold of me. Best way is through email. jslonosk@connect.Carleton.ca. And if that is still in our group but doesn't participate, too bad for him--JSlonosky 14:42, 22 November 2010 (UTC)
----------
The section on the related work has all the things we need to as far as other papers go. Also, I was able to find other research papers that are not mentioned in the paper. I will definitely be adding those paper by tonight. For the time being, I will handle the background concepts. I added a group work section below to keep track of whos doing what. I should get the background concept done hopefully by tonight. If anyone want to help with the other sections that would be great, please add your name to the section you want to handle below.

I added a general paper summary below just to illustrate the general idea behind each section. If anybody wants to add anything, feel free to do so. --[[User:Hesperus|Hesperus]] 18:55, 22 November 2010 (UTC)
-----------
I remember the prof mentioned the most important part of the paper is the Critique so we gotta focus on that altogether not just one person for sure.--[[User:Praubic|Praubic]] 19:22, 22 November 2010 (UTC)

-------------
Yeah absloutely, I agree. But first, lets pin down the crucial points. And then we can discuss them collectively. If anyone happens to come across what he thinks is good or bad, then you can add it below to the good/bad points. Maybe the group work idea is bad, but I just thought maybe if we each member focuses on a specific part in the beginning, we can maybe have a better overall idea of what the paper is about. --[[User:Hesperus|Hesperus]] 19:42, 22 November 2010 (UTC)
--------------
Ok, another thing I figured is that the paper doesn't directly hint at why nested virtualization is necessary? I posted a link in references and I'l try to research more into the purpose of nested virtualization.--[[User:Praubic|Praubic]] 19:45, 22 November 2010 (UTC)
--------------
Actually the paper does talk about that. Look at the first two paragraphs in the introduction section of the paper on page 1. But you're right, they don't really elaborate, I think its because its not the purpose or the aim of the paper in the first place. --[[User:Hesperus|Hesperus]] 20:31, 22 November 2010 (UTC)

=Paper summary=

==The idea/goal==
The paper provides a solution for Nested-virtualization on x86 based computers. Their approach is software-based, meaning that, they're not really altering the underlying architecture, and this is basically the most interesting thing about the paper, since x86 computers don't support nested-virtualization in terms of hardware. But apparently they were able to do it.
In addition, generally, nested virtualization is not supported on x86 systems (the architecture is not designed with that in mind) but for example vista runs XP VM under the covers when running xp programs which shows ability for parallel virtualization on a single hypervisor.

The goal of nested virtualization and multiple host hypervisors comes down to efficiency. Example: Virtualization on servers has been rapidly gaining popularity. The next evolution step is to extend a single level of memory management virtualization support to handle nested virtualization, which is critical for ''high performance''. [1]

----

==Related work==

==Theory (Section 3.1)==

Apparently, theres 2 models to applying nested-virtualization:

* Multiple-level architecture support: where every hypervisor handles every other hypervisor running on top of it. For instance, if L0 (host hypervisor) runs L1. If L1 attempts to run L2, then the trap handling and the work needed to be done to allow L1 to instantiate a new VM is handled by L0. More generally, if L2 attempts to created its own VM, then L1 will handle the trap handling and such.

* Single-level architecture support: This is the model supported by the x86 machines. This model is tied into the concept of "Trap and emulate", where every hypervisor tries to emulate the underlying hardware (the VMX chip in the paper implementation) and presents a fake ground for the hypervisor running on top of it (the guest hypervisor) to operate on, letting it think that he's running on the actual hardware. The idea here is that in order for a guest hypervisor to operate and gain hardware-level privileges, it evokes a fault or a trap, this trap or fault is then handled or caught by the main host hypervisor and then inspected to see if its a legitimate or appropriate command or request, if it is, the host gives privilige to the guest, again having it think that its actually running on the main bare-metal hardware.

In this model, everything must go back to the main host hypervisor. Then the hosy hypervisor forwards the trap and virtualization specification to the above-level involved or responsible. For instance, if L0 runs L1. Then L1 attempts to run L2. Then the command to run L2 goes down to L0 and then L0 forwards this command to L1 again. This is the model we're interested in because this what x86 machines basically follow. Look at figure 1 in the paper for a better understanding of this.

==Main contribution==
The paper propose two new-developed techniques:
* Multi-dimensional paging (for memory virtualization)
* Multiple-level device management (for I/O virtualization)

Other contributions:
* Micro-optimizations to improve performance.

==Implementation==

==Performance==

==Critique==
'''The good:'''

* From what I read so far, the research showed in the paper is probably the first to achieve efficent x86 nested-virtualization without altering the hardware, relying on software-only techniques and mechanisms. They also won the Jay Lepreau best paper award.

'''The bad:'''

==References==
[1] http://www.haifux.org/lectures/225/ - '''Nested x86 Virtualization - Muli Ben-Yehuda'''

==Background Concepts and Other Stuff==
The above looks good. I thought id maybe start touching on some of the sections, so let me know what you guys think. Heres what I think would be useful to go over in the Background Concepts section:

* Firstly, nested virtualization. Why we use nested virtualization (paper gives example of XP inside win 7). Maybe going over the trap and emulate model of nested virtualization.
* Some of the terminology of nested virtualization. The difference between guest/host hypervisors (we're already familiar with guest/host OSs), the terminology of L0, ..., Ln with L0 being the bottom hypervisor, etc
* x86 nested virtualization limitations. Single level architecture, guest/host mode, VMX instructions and how to emulate them. Some of this is in section 3.2of the paper.

Again, anything else you guys think we should add would be great.

Commenting some more on the above summary, under the "main contributions" part, do you think we should count the nested VMX virtualization part as a contribution? If we have multiplexing memory and multiplexing I/O as a main contribution, it would seem to make sense to have multiplexing the CPU as well, especially within the limitations of the x86 architecture. Unless they are using someone else's technique for virtualizing these instructions.--[[User:Mbingham|Mbingham]] 21:16, 22 November 2010 (UTC)

Talk:COMP 3000 Essay 2 2010 Question 9

2010-11-15T15:08:17Z

Mbingham:

Group members:

* Munther Hussain
* Jonathon Slonosky
* Michael Bingham

---------------

Hey there, this is Munther. The prof said that we should be contacting each other to see whos still on board for the course. So please
if you read this, add your name to the list of members above. You can my find my contact info in my profile page by clicking my signature. We shall talk about the details and how we will approach this in the next few days --[[User:Hesperus|Hesperus]] 16:41, 12 November 2010 (UTC)

---------------------

Checked in -- JSlonosky

----------------------

Pawel has already contacted us so he still in for the course, that makes 3 of us. The other three members, please drop in and add your name. We need to confirm the members today by 1:00 pm. --[[User:Hesperus|Hesperus]] 12:18, 15 November 2010 (UTC)

----------------------

Checked in --[[User:Mbingham|Mbingham]] 15:08, 15 November 2010 (UTC)

COMP 3000 Lab 5 2010

2010-11-09T15:33:12Z

Mbingham: added \n

In this lab you will work with the commands '''ltrace''' and '''strace'''. You may need to install these programs; if so, on Debian and Ubuntu systems type ''sudo aptitude install strace ltrace''.

==Questions==

Answer all of the following questions.

# Compile hello.c with the command ''gcc -O hello.c -o hello-dyn''. Run ''ltrace ./hello-dyn''. What dynamic functions did hello call?
# Run ''strace ./hello-dyn''. How many system calls did hello make?
# Re-compile hello.c using the command ''gcc -static -O hello.c -o hello-static''. Run hello-static using ltrace and strace. How does the output compare with that from the previous two questions? (Explain at a high level.)
# How big are the binaries of hello-dyn and hello-static? Why is one so much bigger than the other one? Explain.
# Add a ''while(1) sleep(1);'' loop to hello.c so that it waits forever after saying hello. Recompile statically and dynamically. What is the resident and virtual memory used by both?
# Compile and run hello-fork.c. Note that hello-fork.c produces a ''zombie'' process. How do you fix hello-fork.c so that the zombie exits properly?
# How can you modify hello-fork.c so that the child process would run ''/bin/ls'' using the execve() function?
# strace and ltrace examine the behavior of a running program. The entire design of UNIX is such that each process lives in its own address space. How can these programs work? What mechanism must they use? '''Extra credit:''' what is that mechanism? '''Extra extra credit:''' make a program that tells you the PIDs of all forked children of a monitored process.

==Code==

/* hello.c */
#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[])
{

printf("Hello, world!\n");
return 0;
}

/* hello-fork.c */
#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[])
{

int pid;

printf("Hello, world!\n");

pid = fork();

if (pid) {
/* parent */
while(1) {
sleep(1);
}
} else {
/* child */
printf("I am the child!\n");
}
return 0;
}

==Hints==

COMP 3000 Essay 1 2010 Question 11

2010-10-15T03:31:01Z

Mbingham: /* Security */ It looks like i've made a lot of edits but it's mostly minor stuff.

=Question=

Why are object stores an increasingly attractive building block for filesystems (as opposed to block-based stores)? Explain.

=Answer=

== Introduction ==

Each year we are faced with growing storage needs as the world's information increases exponentially, business' are increasingly choosing to archive and retain all the data they produce and "store everything, forever" (Dell, 2010)[[#Foot1|1]] is the common mantra of storage administrators. The storage industry has been able to keep up with the increasing demand with matching increases in storage capacity. Unfortunately the interfaces between clients and storage devices has hardly changed since the 1950s. The dominate storage mechanism is still block-based storage technology.

Innovation in storage technology is especially pertinent to businesses that use network storage. The two dominant technologies of network storage; storage area network (SAN) and network-attached storage (NAS), each have their own benefits and drawbacks and would benefit greatly with improvement in storage technology. Specifically, improvements that can provide better scalability, business intelligence, and management while ensuring security and data access speed of traditional storage solutions would be ideal.

Object Based Storage Devices (OSD) solve these issues by design. Using objects that consist of both data and metadata, they are accessed with defined methods such as read and write and carry a unique identifier. They also handle the underlying security, space allocation and basic storage routines.[[#Foot2|2]] This storage technology has the potential to address some of the problems with block-based storage.

With increased scalability, better security through per-object level access, ensured integrity of data with unique hash keys and benefits in management and business intelligence with rich metadata, OSD can be seen as a viable alternative to improve the standard architectures of SAN and NAS.

== Overview of Block-Based Storage ==

Hard disks as a storage medium date back to the 1950s with the introduction of the IBM 350 disk storage unit.[[#Foot3|3]] Hard disks store data in blocks, which are a fixed length series' of bytes. Since early devices like the IBM 350, the interface that the operating system uses to communicate with the hard disk has remained mostly the same.[[#Foot4|4]] This interface simply allows the operating system to read or write to blocks on the disk. This means that the goal of abstracting stored data into related groups or into human-understandable constructs such as objects or files is left completely in the space of the operating system's filesystem. For example, when the filesystem wants to write data to a file it must translate that into a block on the disk to write to. In this way, the scope of a filesystem extends from high level constructs like files to low level constructs like blocks. This wide scope is necessary because of the simple interface presented to the filesystem that must be abstracted up to the complex expectations of a user.

Multiple standards exist to implement this interface. The small computer system interface (SCSI) standards, which have been around in one form or another since the late 1970s, are popular with industry. Parallel ATA, another standard which was designed in the 1980s, continues today in the form of Serial ATA (SATA). However, even though these standards have been around for a long time, "the logical interface, or the command set, has seen only minor additions" (Bandulet, 2007)[[#Foot2|2]]. This means that the functionality that the command set allows has also remained mostly the same, since the functionality must be built on top of these dated commands.

== Overview of Object-Based Storage ==

Unlike block-based storage, object-based storage research started in the 1990s. See for example the work of Gibson et al in "A Cost-Effective, High-Bandwidth Storage Architecture", Proceedings of the 8th Conference on Architectural Support for Programming Languages and Operating Systems, 1998. The fundamental idea of an object based storage device is to have the storage device itself handle a layer of abstraction on top of the block. Instead of the interface presenting the filesystem with blocks to read and write to, the interface presents the filesystem with "objects" which it can read to, write to, create, or destroy. Objects can be variable sized, and the device itself handles mapping onto physical memory. These objects also have metadata and access controls immediately associated with them. This allows the filesystem to work at a higher level of abstraction. This is important because the needs placed on filesystems have changed, and we will see as we compare object based storage with block based storage that the design of objects is more suited to the needs of today's filesystems, than blocks, especially with networked filesystems,.

== Changing Storage Needs ==

Storage needs have changed significantly since the first hard disks were developed in the 1950s, and the standardization of the interface in the 1970s. This means that the functionality of storage devices must also change to reflect these needs. Storage has become increasingly networked. Networked storage must deal with several issues. Firstly, the storage architecture must be able to scale to terabytes of data and beyond with many servers and clients while avoiding bottlenecks. The data stored on these networks has also become more sensitive. Personal information, such as financial, is stored in large databases. Sensitive corporate and governmental information is stored similarly. Since the value of data has increased, it becomes more important to ensure the data's integrity and security. Block based storage, as we will see, has difficulty dealing with these priorities because of limitations inherent in its design. Object based storage is more suited to address these issues by design.

== Comparison of object and block based stores ==
=== Scalability ===
Scalability is very important for large businesses that need to manage large data centers. Managing metadata while ensuring data access speed as the systems grows is paramount.

Most block based storage systems contain many layers of metadata. There are also various types of virtualized systems that contain metadata to deal with device diversity or remapping of blocks for archiving or duplication. Building systems to scale with the metadata becomes a major issue. But at the same time the current speeds of block-based storage needs to be maintained.

NAS is a file system that coordinates the interface between file blocks and the clients access to files. This is done through a single NAS head which usually has thousands of gigabytes of storage behind it.[[#Foot5|5]] All data traffic must flow through this single access point. The benefits of the NAS file system is through its ability to set block access, manage security, prevent unauthorized access to files and use metadata to map blocks into files for the client. However, this causes a bottleneck issue with all the data passing through one point. Another issue is managing the metadata. Metadata is shared among separate metadata servers remote from the hosts. Space allocation management on different storage system layers and applications that add policy and management metadata individually is spread throughout the system. So this results in the metadata becoming very hard to manage.

SAN's on the other hand offer file systems that are distributed, but provide a single system image of the file system. This means that a local user need not be concerned with where the data is physically stored, since a level of abstraction separates the user from the physical location of the data. This eliminates the bottleneck of NAS. In the past, SANs were implemented on private fiber channel networks, which were designed to emulate local storage media. As long as the network remained exclusive, it could be assumed that all the clients could be trusted, so security was not a primary concern. The lack of security concern is one of the main reasons that block storage was a viable option for SANs of the past. Modern SANs can serve a much larger set of users, not all of whom can or should be trusted. This, in addition to the possible adoption of IP based SAN solutions, make data security a primary concern[[#Foot6|6]]. Object stores can make user privilege management a much more manageable task, since each object can 'know' who is allowed to access it.

Object storage provides the ability to operate a SAN setup with direct access to data while offering better security and scalability with metadata. Each object comes with a set of access rules given to it by the management server and metadata is associated and stored directly with each data object and is automatically carried between layers and across devices. Space allocation and management metadata are the responsibility of the storage device.[[#Foot1|1]] This allows metadata layers to be folded, reducing server overhead and processing, and allows for larger clusters of storage compared with traditional block-based interfaces.

=== Integrity ===
Block based file systems in archive solutions usually have no built in mechanisms for assuring data integrity. A common best practice is to conduct frequent backups, which adds to the complexity of using file systems for archiving and scalability. The mechanisms for ensuring data integrity in OSDs have mechanisms that operate differently from block store systems.

One of the major problems with storage at the block level is that if there is an error in a block, it is almost impossible to determine what part of the file system is affected. It may be the case that the error in a particular block may not even contain any data. This usually happens during a backup procedure or when a controller is organizing data.

OSDs provide a level of abstraction that hides the fact that a disk device has blocks. It no longer matters to the file system manager what kind of disk drive is being used, it only worries about managing objects. This is done through managing metadata as well as maintaining internal copies of its metadata. Hence, OSDs have knowledge of its object layout even though one or more groups of objects are on different OSDs. In this way OSDs know what kind of space is being used or unused and can scan and correct errors without losing data. In the event of a failure in recovering a file or a number of files, traditional systems may have to do a complete file system restore. However, an OSDs awareness of its object layout enables it to recover data specific to a byte range and thus restore files in an efficient manner.

OSDs have another powerful feature. Each object file has an associated hash key that is generated uniquely to the contents of the file. Thus the file can be verified for accuracy to ensure the contents remain the same and integrity to ensure the data has not been corrupted. Also it can be used for management of data to flag duplicate data.[[#Foot1|1]]

=== Security ===

Security is an issue that must be confronted in all modern storage networks. Security issues come in a wide variety of types, so can be difficult to deal with. Both SAN and NAS have a variety of ways for handling security, but an object based approach can make the implementation of security measures more effective and easier to manage.

SAN has traditionally run on fibre channels.[[#Foot7|7]] For the sake of security, running a SAN on fibre channels help isolate its network as they do not communicate over TCP/IP connections. However, since the SAN devices themselves do not restrict access, it's up to the network infrastructure and host system to handle its security.

Zoning and LUN masking are typical ways SAN systems could use as security measures. Zoning allocates a certain amount of storage to clients. These zones are isolated and are not allowed to communicate outside their respective zone. LUN masking is similar to zoning, however, they differ in the type of devices being used. Switches utilize zoning while disk array controllers use LUN masking. A disk array controller is a device which manages the physical disk drives and interprets them as logical unit numbers. Thus, the term LUN masking.[[#Foot8|8]]

NAS has its own vulnerabilities but as with SAN, it is only as secure as the network they operate on. NAS security is conceptually simpler than SAN. NAS environments can administer security tasks as well as control disk usage quotas. The proprietary operating system it runs on has access control configurations much like other traditional OSs that can prevent unauthorized access to data.

Unlike NAS and SAN systems, OSD devices handle security requests directly. The set of protocols used by OSD gives it a fair amount of flexibility in controlling access. Clients can access an OSD device by providing "cryptographically secure credentials", called capabilities, which specify a tuple (OSD name, partition ID, object ID) to identify the object.[[#Foot9|9]] This can prevent a wide range of potential attacks, which gives OSD systems an advantage over block based systems.

== Real World Implementation ==

Ceph is an example of a real world networked storage system based around OSDs. The Ceph developers specifically list performance, reliability, and scalability as the benefits their system offers over current solutions.[[#Foot10|10]] Since Ceph is based on OSDs, it takes advantage of the ability for clients to interact directly with the devices, which avoids the traditional bottlenecks to performance caused by SAN controllers or NAS heads. This direct access allows Ceph to support a very large number of clients concurrently accessing data on the system. Since objects have security controls it can allow this direct access safely, unlike other network storage architectures.

== Conclusion ==
Although object storage is relatively new compared to block storage, work has progressed steadily in universities and on standards such as the ANSI T10 SCSI OSD standard. However, there remains challenges to its adoption in the industry. One of which, is that OSD is only needed in high end business solutions at the moment, preventing it from reaching smaller businesses.[[#Foot11|11]] As newer features are added and the standards mature we will see an increased adoption.

It is obvious however that changes do need to occur as storage grows and finer levels of management are needed for data storage. Object-based storage has evolved to fit these needs where block-based storage has stagnated. The better tools for managing the data using the rich metadata of objects, the security and data transfer speeds of NAS and SAN combined with integrity controls for backups and redundancies will be an attractive choice for storage administrators in the future.

==References==
1 Dell Product Group, 2010. Object Storage A Fresh Approach to Long-Term File Storage. [online] Dell Available at: <http://www.dell.com/downloads/global/products/pvaul/en/object-storage-overview.pdf> [Accessed 13 October 2010].

2 C. Bandulet, 2007. Object-Based Storage Devices. [online] Oracle Available at: <http://developers.sun.com/solaris/articles/osd.html> [Accessed 13 October 2010].

3 IBM 350 disk storage unit, IBM Archives. [online] IBM Available at : <http://www-03.ibm.com/ibm/history/exhibits/storage/storage_350.html> [Accessed 14 October 2010].

4 M. Mesnier, G. R. Ganger, and E. Riedel. Object-Based Storage. IEEE Communications Magazine, 41(8), August 2003.

5 TechRepublic Guest Contributor, Foundations of Network Storage, Lesson Two: NAS. [online] Available at <http://articles.techrepublic.com.com/5100-22_11-5841266.html> [Accessed 14 October 2010].

6 Satran and Teperman, Object Store Based SAN File Systems. [online] IBM Labs Available at: <http://www.research.ibm.com/haifa/projects/storage/zFS/papers/amalfi.pdf> [Accessed 14 October 2010].

7 J. Tate, F. Lucchese, R. Moore. Introduction to Storage Area Networks. [online] Available at <http://www.redbooks.ibm.com/redbooks/pdfs/sg245470.pdf> [Accessed 14 October 2010].

8 H. Yoshida. LUN Security Considerations for Storage Area Networks. [online] Available at <http://www.it.hds.com/pdf/wp91_san_lun_secur.pdf> [Accessed 14 October 2010].

9 M. Factor, D. Nagle, D. Naor, E. Riedel, J.Satran, 2005. The OSD Security Protocol. [online] Available at <http://www.research.ibm.com/haifa/projects/storage/objectstore/papers/OSDSecurityProtocol.pdf> [Accessed 14 October 2010].

10 S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long,
and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proc. OSDI, 2006. [online] Available at: <http://www.usenix.org/events/osdi06/tech/full_papers/weil/weil_html/> [Accessed 14 October 2010].

11 M. Factor, K. Meth, D. Naor, O. Rodeh, J. Satran, 2005. Object storage: The future building block for storage systems. In 2nd International IEEE Symposium on Mass Storage Systems and Technologies, Sardinia [online] Available at: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.3959&rep=rep1&type=pdf> [Accessed 13 October 2010].

COMP 3000 Essay 1 2010 Question 11

2010-10-15T03:13:17Z

Mbingham: /* Introduction */ Just minor language editing

=Question=

Why are object stores an increasingly attractive building block for filesystems (as opposed to block-based stores)? Explain.

=Answer=

== Introduction ==

Each year we are faced with growing storage needs as the world's information increases exponentially, business' are increasingly choosing to archive and retain all the data they produce and "store everything, forever" (Dell, 2010)[[#Foot1|1]] is the common mantra of storage administrators. The storage industry has been able to keep up with the increasing demand with matching increases in storage capacity. Unfortunately the interfaces between clients and storage devices has provided mostly the same functionality since the 1950's. The dominate storage mechanism is still block-based storage technology.

Innovation in storage technology is especially pertinent to businesses that use network storage. The two dominant technologies of network storage; storage area network (SAN) and network-attached storage (NAS), each have their own benefits and drawbacks and would benefit greatly with improvement in storage technology. Specifically, improvements that can provide better scalability, business intelligence, and management while ensuring security and data access speed of traditional storage solutions would be ideal.

Object Based Storage Devices (OSD) solve these issues because of how they are designed. Object storage uses objects that consists of data and meta-data that describe the object. They are accessed with defined methods such as read and write and carry a unique ID. They also handle the underlying security, space allocation and basic storage routines.[[#Foot2|2]] This storage technology has the potential to address some of the problems with block-based storage.

With increased scalability, better security through per-object level access, ensured integrity of data with unique hash key's and benefits in management and business intelligence with rich meta-data, OSD can be seen as a viable alternative to improve the standard architectures of SAN and NAS.

== Overview of Block-Based Storage ==

Hard disks as a storage medium date back to the 1950s with the introduction of the IBM 350 disk storage unit.[[#Foot3|3]] Hard disks store data in blocks, which are a fixed length series' of bytes. Since early devices like the IBM 350, the interface that the operating system uses to communicate with the hard disk has remained mostly the same.[[#Foot4|4]] This interface simply allows the operating system to read or write to blocks on the disk. This means that the goal of abstracting stored data into related groups or into human-understandable constructs such as objects or files is left completely in the space of the operating system's filesystem. For example, when the filesystem wants to write data to a file it must translate that into a block on the disk to write to. In this way, the scope of a filesystem extends from high level constructs like files to low level constructs like blocks. This wide scope is necessary because of the simple interface presented to the filesystem that must be abstracted up to the complex expectations of a user.

Multiple standards exist to implement this interface. The small computer system interface (SCSI) standards, which have been around in one form or another since the late 1970s, are popular with industry. Parallel ATA, another standard which was designed in the 1980s, continues today in the form of Serial ATA (SATA). However, even though these standards have been around for a long time, "the logical interface, or the command set, has seen only minor additions" (Bandulet, 2007)[[#Foot2|2]]. This means that the functionality that the command set allows has also remained mostly the same, since the functionality must be built on top of these dated commands.

== Overview of Object-Based Storage ==

Unlike block-based storage, object-based storage research started in 1990s. See for example the work of Gibson et al in "A Cost-Effective, High-Bandwidth Storage Architecture", Proceedings of the 8th Conference on Architectural Support for Programming Languages and Operating Systems, 1998. The fundamental idea of an object based storage device is to have the storage device itself handle a layer of abstraction on top of the block. Instead of the interface presenting the filesystem with blocks to read and write to, the interface presents the filesystem with "objects" which it can read to, write to, create, or destroy. Objects can be variable sized, and the device itself handles mapping onto physical memory. These objects also have metadata and access controls immediately associated with them. This allows the filesystem to work at a higher level of abstraction. This is important because the needs placed on filesystems have changed, and we will see as we compare object based storage with block based storage that the design of objects are more suited to the needs of today's filesystems, especially networked filesystems, than blocks.

== Changing Storage Needs ==

Storage needs have changed significantly since the first hard disks were developed in the 1950s, and the standardization of the interface in the 1970s. This means that the functionality of storage devices must also change to reflect these needs. Storage has become increasingly networked. Networked storage must deal with several issues. Firstly, the storage architecture must be able to scale to terabytes of data and beyond with many servers and clients while avoiding bottlenecks. The data stored on these networks has also become more sensitive. Personal information, such as financial, is stored in large databases. Sensitive corporate and governmental information is stored similarly. Since the value of data has increased, it becomes more important to ensure the data's integrity and security. Block based storage, as we will see, has difficulty dealing with these priorities because of limitations inherent in its design. Object based storage is more suited to address these issues by design.

== Comparison of object and block based stores ==
=== Scalability ===
Scalability is very important for large businesses that need to manage large data centers. Managing metadata while ensuring data access speed as the systems grows is paramount.

Most block based storage systems contain many layers of metadata. There are also various types of virtualized systems that contain metadata to deal with device diversity or remapping of blocks for archiving or duplication. Building systems to scale with the metadata becomes a major issue. But at the same time the current speeds of block-based storage needs to be maintained.

NAS is a file system that coordinates the interface between file blocks and the clients access to files. This is done through a single NAS head which usually has thousands of gigabytes of storage behind it.[[#Foot5|5]] All data traffic must flow through this single access point. The benefits of the NAS file system is through its ability to set block access, manage security, prevent unauthorized access to files and use metadata to map blocks into files for the client. However, this causes a bottleneck issue with all the data passing through one point. Another issue is managing the metadata. Metadata is shared among separate metadata servers remote from the hosts. Space allocation management on different storage system layers and applications that add policy and management metadata individually is spread throughout the system. So this results in the metadata becoming very hard to manage.

SAN's on the other hand offer file systems that are distributed, but provide a single system image of the file system. This means that a local user need not be concerned with where the data is physically stored, since a level of abstraction separates the user from the physical location of the data. This eliminates the bottleneck of NAS. In the past, SANs were implemented on private fiber channel networks, which were designed to emulate local storage media. As long as the network remained exclusive, it could be assumed that all the clients could be trusted, so security was not a primary concern. The lack of security concern is one of the main reasons that block storage was a viable option for SANs of the past. Modern SANs can serve a much larger set of users, not all of whom can or should be trusted. This, in addition to the possible adoption of IP based SAN solutions, make data security a primary concern[[#Foot6|6]]. Object stores can make user privilege management a much more manageable task, since each object can 'know' who is allowed to access it.

Object storage provides the ability to operate a SAN setup with direct access to data while offering better security and scalability with metadata. Each object comes with a set of access rules given to it by the management server and metadata is associated and stored directly with each data object and is automatically carried between layers and across devices. Space allocation and management metadata are the responsibility of the storage device.[[#Foot1|1]] This allows metadata layers to be folded, reducing server overhead and processing, and allows for larger clusters of storage compared with traditional block-based interfaces.

=== Integrity ===
Block based file systems in archive solutions usually have no built in mechanisms for assuring data integrity. A common best practice is to conduct frequent backups, which adds to the complexity of using file systems for archiving and scalability. The mechanisms for ensuring data integrity in OSDs have mechanisms that operate differently from block store systems.

One of the major problems with storage at the block level is that if there is an error in a block, it is almost impossible to determine what part of the file system is affected. It may be the case that the error in a particular block may not even contain any data. This usually happens during a backup procedure or when a controller is organizing data.

OSDs provide a level of abstraction that hides the fact that a disk device has blocks. It no longer matters to the file system manager what kind of disk drive is being used, it only worries about managing objects. This is done through managing metadata as well as maintaining internal copies of its metadata. Hence, OSDs have knowledge of its object layout even though one or more groups of objects are on different OSDs. In this way OSDs know what kind of space is being used or unused and can scan and correct errors without losing data. In the event of a failure in recovering a file or a number of files, traditional systems may have to do a complete file system restore. However, an OSDs awareness of its object layout enables it to recover data specific to a byte range and thus restore files in an efficient manner.

OSDs have another powerful feature. Each object file has an associated hash key that is generated uniquely to the contents of the file. Thus the file can be verified for accuracy to ensure the contents remain the same and integrity to ensure the data has not been corrupted. Also it can be used for management of data to flag duplicate data.[[#Foot1|1]]

=== Security ===

Security threats can be thought of as having four quadrants. External, internal, accidental and malicious. Block based stores have a variety of ways for handling security but there are basic concepts that SAN and NAS technologies use to secure data.

SAN has traditionally run on fibre channels. [[#Foot7|7]] For the sake of security, running a SAN on fibre channels help isolate its network as they do not communicate over TCP/IP connections. However, since the SAN devices themselves do not restrict access, it's up to the network infrastructure and host system to handle its security.

Zoning and LUN masking are typical ways SAN systems could use as security measures. Zoning allocates a certain amount of storage to clients. These zones are isolated and are not allowed to communicate outside their respective zone. LUN masking is similar to zoning, however, they differ in the type of devices being used. Switches utilize zoning while disk array controllers use LUN masking. A disk array controller is a device which manages the physical disk drives and interprets them as logical unit numbers. Thus, the term LUN masking.[[#Foot8|8]]

NAS has its own vulnerabilities but as with SAN, it is only as secure as the network they operate on. NAS security is conceptually simpler than SAN. NAS environments can administer security tasks as well as control disk usage quotas. The proprietary operating system it runs on has access control configurations much like other traditional OSs that can prevent unauthorized access to data.

Unlike NAS and SAN systems, OSD devices handle security requests directly. The set of protocols used by OSD enable it to cover the four quadrants of security threats outlined above. Clients can access an OSD device by providing "cryptographically secure credentials", called capabilities, which specify a tuple (OSD name, partition ID, object ID) to identify the object.[[#Foot9|9]] This can prevent accidental or even malicious access to an OSD externally or internally.

== Real World Implementation ==

Ceph is an example of a real world networked storage system based around OSDs. The Ceph developers specifically list performance, reliability, and scalability as the benefits their system offers over current solutions.[[#Foot10|10]] Since Ceph is based on OSDs, it takes advantage of the ability for clients to interact directly with the devices, which avoids the traditional bottlenecks to performance caused by SAN controllers or NAS heads. This direct access allows Ceph to support a very large number of clients concurrently accessing data on the system. Since objects have security controls it can allow this direct access safely, unlike other network storage architectures.

== Conclusion ==
Although object storage is relatively new compared to block storage, work as progressed steadily in universities and on standards such as the ANSI T10 SCSI OSD standard. But there remains challenges to its adoption in the industry. One of which, is that it is only needed in high end business solutions at the moment, preventing it from reaching smaller businesses.[[#Foot11|11]] But as newer features are added and the standards mature we will see an increased adoption.

It is obvious however that changes do need to occur as storage grows and finer levels of management are needed for data storage. Object-based storage has evolved to fit these needs where block-based storage has stagnated. The better tools for managing the data using the rich metadata of objects, the security and data transfer speeds of NAS and SAN combined and integrity controls for backups and redundancies will be an attractive choice for storage administrators in the future.

==References==
1 Dell Product Group, 2010. Object Storage A Fresh Approach to Long-Term File Storage. [online] Dell Available at: <http://www.dell.com/downloads/global/products/pvaul/en/object-storage-overview.pdf> [Accessed 13 October 2010].

2 C. Bandulet, 2007. Object-Based Storage Devices. [online] Oracle Available at: <http://developers.sun.com/solaris/articles/osd.html> [Accessed 13 October 2010].

3 IBM 350 disk storage unit, IBM Archives. [online] IBM Available at : <http://www-03.ibm.com/ibm/history/exhibits/storage/storage_350.html> [Accessed 14 October 2010].

4 M. Mesnier, G. R. Ganger, and E. Riedel. Object-Based Storage. IEEE Communications Magazine, 41(8), August 2003.

5 TechRepublic Guest Contributor, Foundations of Network Storage, Lesson Two: NAS. [online] Available at <http://articles.techrepublic.com.com/5100-22_11-5841266.html> [Accessed 14 October 2010].

6 Satran and Teperman, Object Store Based SAN File Systems. [online] IBM Labs Available at: <http://www.research.ibm.com/haifa/projects/storage/zFS/papers/amalfi.pdf> [Accessed 14 October 2010].

7 J. Tate, F. Lucchese, R. Moore. Introduction to Storage Area Networks. [online] Available at <http://www.redbooks.ibm.com/redbooks/pdfs/sg245470.pdf> [Accessed 14 October 2010].

8 H. Yoshida. LUN Security Considerations for Storage Area Networks. [online] Available at <http://www.it.hds.com/pdf/wp91_san_lun_secur.pdf> [Accessed 14 October 2010].

9 M. Factor, D. Nagle, D. Naor, E. Riedel, J.Satran, 2005. The OSD Security Protocol. [online] Available at <http://www.research.ibm.com/haifa/projects/storage/objectstore/papers/OSDSecurityProtocol.pdf> [Accessed 14 October 2010].

10 S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long,
and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proc. OSDI, 2006. [online] Available at: <http://www.usenix.org/events/osdi06/tech/full_papers/weil/weil_html/> [Accessed 14 October 2010].

11 M. Factor, K. Meth, D. Naor, O. Rodeh, J. Satran, 2005. Object storage: The future building block for storage systems. In 2nd International IEEE Symposium on Mass Storage Systems and Technologies, Sardinia [online] Available at: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.3959&rep=rep1&type=pdf> [Accessed 13 October 2010].

COMP 3000 Essay 1 2010 Question 11

2010-10-15T02:57:16Z

Mbingham: /* Introduction */ This seems to be a direct quote from the paper, so needs author and date

=Question=

Why are object stores an increasingly attractive building block for filesystems (as opposed to block-based stores)? Explain.

=Answer=

== Introduction ==

Each year we are faced with growing storage needs as the world's information increases exponentially, business' are increasingly choosing to archive and retain all the data they produce and "store everything, forever" (Dell, 2010)[[#Foot1|1]] is the common mantra of storage administrators. The storage industry has been able to keep up with the increasing demand with matching increases in storage capacity. Unfortunately the interfaces between clients and storage devices has remained unchanged since the 1950's. The dominate storage mechanism is still block-based storage technology.

Innovation in storage technology is especially pertinent to businesses that use network storage. The two dominant technologies of network storage: storage area network (SAN) and network-attached storage (NAS) each with their own benefits and drawbacks would benefit greatly with improvement in storage technology. Improvements that can provide better scalability, business intelligence, and management while ensuring security and data access speed of traditional storage solutions would be ideal.

Object Based Storage Devices (OSD) solve these issues because of how they are designed. Object storage uses objects that consists of data and meta-data that describe the object. They are accessed with defined methods such as read and write and carry a unique ID. They handle the underlying security, space allocation and basic storage routines.[[#Foot2|2]] This storage technology has the potential to address some of the problems with block-based storage.

With increased scalability, better security through per-object level access, ensured integrity of data with unique hash key's and benefits in management and business intelligence with rich meta-data, OSD can be seen as a viable alternative to improve the standard architectures of SAN and NAS.

== Overview of Block-Based Storage ==

Hard disks as a storage medium date back to the 1950s with the introduction of the IBM 350 disk storage unit.[[#Foot3|3]] Hard disks store data in blocks, which are a fixed length series' of bytes. Since early devices like the IBM 350, the interface that the operating system uses to communicate with the hard disk has remained mostly the same.[[#Foot4|4]] This interface simply allows the operating system to read or write to blocks on the disk. This means that the goal of abstracting stored data into related groups or into human-understandable constructs such as objects or files is left completely in the space of the operating system's filesystem. For example, when the filesystem wants to write data to a file it must translate that into a block on the disk to write to. In this way, the scope of a filesystem extends from high level constructs like files to low level constructs like blocks. This wide scope is necessary because of the simple interface presented to the filesystem that must be abstracted up to the complex expectations of a user.

Multiple standards exist to implement this interface. The small computer system interface (SCSI) standards, which have been around in one form or another since the late 1970s, are popular with industry. Parallel ATA, another standard which was designed in the 1980s, continues today in the form of Serial ATA (SATA). However, even though these standards have been around for a long time, "the logical interface, or the command set, has seen only minor additions" (Bandulet, 2007)[[#Foot2|2]]. This means that the functionality that the command set allows has also remained mostly the same, since the functionality must be built on top of these dated commands.

== Overview of Object-Based Storage ==

Unlike block-based storage, object-based storage research started in 1990s. See for example the work of Gibson et al in "A Cost-Effective, High-Bandwidth Storage Architecture", Proceedings of the 8th Conference on Architectural Support for Programming Languages and Operating Systems, 1998. The fundamental idea of an object based storage device is to have the storage device itself handle a layer of abstraction on top of the block. Instead of the interface presenting the filesystem with blocks to read and write to, the interface presents the filesystem with "objects" which it can read to, write to, create, or destroy. Objects can be variable sized, and the device itself handles mapping onto physical memory. These objects also have metadata and access controls immediately associated with them. This allows the filesystem to work at a higher level of abstraction. This is important because the needs placed on filesystems have changed, and we will see as we compare object based storage with block based storage that the design of objects are more suited to the needs of today's filesystems, especially networked filesystems, than blocks.

== Changing Storage Needs ==

Storage needs have changed significantly since the first hard disks were developed in the 1950s, and the standardization of the interface in the 1970s. This means that the functionality of storage devices must also change to reflect these needs. Storage has become increasingly networked. Networked storage must deal with several issues. Firstly, the storage architecture must be able to scale to terabytes of data and beyond with many servers and clients while avoiding bottlenecks. The data stored on these networks has also become more sensitive. Personal information, such as financial, is stored in large databases. Sensitive corporate and governmental information is stored similarly. Since the value of data has increased, it becomes more important to ensure the data's integrity and security. Block based storage, as we will see, has difficulty dealing with these priorities because of limitations inherent in its design. Object based storage is more suited to address these issues by design.

== Comparison of object and block based stores ==
=== Scalability ===
Scalability is very important for large businesses that need to manage large data centers. Managing metadata while ensuring data access speed as the systems grows is paramount.

Most block based storage systems contain many layers of metadata. There are also various types of virtualized systems that contain metadata to deal with device diversity or remapping of blocks for archiving or duplication. Building systems to scale with the metadata becomes a major issue. But at the same time the current speeds of block-based storage needs to be maintained.

NAS is a file system that coordinates the interface between file blocks and the clients access to files. This is done through a single NAS head which usually has thousands of gigabytes of storage behind it.[[#Foot5|5]] All data traffic must flow through this single access point. The benefits of the NAS file system is through its ability to set block access, manage security, prevent unauthorized access to files and use metadata to map blocks into files for the client. However, this causes a bottleneck issue with all the data passing through one point. Another issue is managing the metadata. Metadata is shared among separate metadata servers remote from the hosts. Space allocation management on different storage system layers and applications that add policy and management metadata individually is spread throughout the system. So this results in the metadata becoming very hard to manage.

SAN's on the other hand offer file systems that are distributed, but provide a single system image of the file system. This means that a local user need not be concerned with where the data is physically stored, since a level of abstraction separates the user from the physical location of the data. This eliminates the bottleneck of NAS. In the past, SANs were implemented on private fiber channel networks, which were designed to emulate local storage media. As long as the network remained exclusive, it could be assumed that all the clients could be trusted, so security was not a primary concern. The lack of security concern is one of the main reasons that block storage was a viable option for SANs of the past. Modern SANs can serve a much larger set of users, not all of whom can or should be trusted. This, in addition to the possible adoption of IP based SAN solutions, make data security a primary concern[[#Foot6|6]]. Object stores can make user privilege management a much more manageable task, since each object can 'know' who is allowed to access it.

Object storage provides the ability to operate a SAN setup with direct access to data while offering better security and scalability with metadata. Each object comes with a set of access rules given to it by the management server and metadata is associated and stored directly with each data object and is automatically carried between layers and across devices. Space allocation and management metadata are the responsibility of the storage device.[[#Foot1|1]] This allows metadata layers to be folded, reducing server overhead and processing, and allows for larger clusters of storage compared with traditional block-based interfaces.

=== Integrity ===
Block based file systems in archive solutions usually have no built in mechanisms for assuring data integrity. A common best practice is to conduct frequent backups, which adds to the complexity of using file systems for archiving and scalability. The mechanisms for ensuring data integrity in OSDs have mechanisms that operate differently from block store systems.

One of the major problems with storage at the block level is that if there is an error in a block, it is almost impossible to determine what part of the file system is affected. It may be the case that the error in a particular block may not even contain any data. This usually happens during a backup procedure or when a controller is organizing data.

OSDs provide a level of abstraction that hides the fact that a disk device has blocks. It no longer matters to the file system manager what kind of disk drive is being used, it only worries about managing objects. This is done through managing metadata as well as maintaining internal copies of its metadata. Hence, OSDs have knowledge of its object layout even though one or more groups of objects are on different OSDs. In this way OSDs know what kind of space is being used or unused and can scan and correct errors without losing data. In the event of a failure in recovering a file or a number of files, traditional systems may have to do a complete file system restore. However, an OSDs awareness of its object layout enables it to recover data specific to a byte range and thus restore files in an efficient manner.

OSDs have another powerful feature. Each object file has an associated hash key that is generated uniquely to the contents of the file. Thus the file can be verified for accuracy to ensure the contents remain the same and integrity to ensure the data has not been corrupted. Also it can be used for management of data to flag duplicate data.[[#Foot1|1]]

=== Security ===

Security threats can be thought of as having four quadrants. External, internal, accidental and malicious. Block based stores have a variety of ways for handling security but there are basic concepts that SAN and NAS technologies use to secure data.

SAN has traditionally run on fibre channels. [[#Foot7|7]] For the sake of security, running a SAN on fibre channels help isolate its network as they do not communicate over TCP/IP connections. However, since the SAN devices themselves do not restrict access, it's up to the network infrastructure and host system to handle its security.

Zoning and LUN masking are typical ways SAN systems could use as security measures. Zoning allocates a certain amount of storage to clients. These zones are isolated and are not allowed to communicate outside their respective zone. LUN masking is similar to zoning, however, they differ in the type of devices being used. Switches utilize zoning while disk array controllers use LUN masking. A disk array controller is a device which manages the physical disk drives and interprets them as logical unit numbers. Thus, the term LUN masking.[[#Foot8|8]]

NAS has its own vulnerabilities but as with SAN, it is only as secure as the network they operate on. NAS security is conceptually simpler than SAN. NAS environments can administer security tasks as well as control disk usage quotas. The proprietary operating system it runs on has access control configurations much like other traditional OSs that can prevent unauthorized access to data.

Unlike NAS and SAN systems, OSD devices handle security requests directly. The set of protocols used by OSD enable it to cover the four quadrants of security threats outlined above. Clients can access an OSD device by providing "cryptographically secure credentials", called capabilities, which specify a tuple (OSD name, partition ID, object ID) to identify the object.[[#Foot9|9]] This can prevent accidental or even malicious access to an OSD externally or internally.

== Real World Implementation ==

Ceph is an example of a real world networked storage system based around OSDs. The Ceph developers specifically list performance, reliability, and scalability as the benefits their system offers over current solutions.[[#Foot10|10]] Since Ceph is based on OSDs, it takes advantage of the ability for clients to interact directly with the devices, which avoids the traditional bottlenecks to performance caused by SAN controllers or NAS heads. This direct access allows Ceph to support a very large number of clients concurrently accessing data on the system. Since objects have security controls it can allow this direct access safely, unlike other network storage architectures.

== Conclusion ==
Although object storage is relatively new compared to block storage, work as progressed steadily in universities and on standards such as the ANSI T10 SCSI OSD standard. But there remains challenges to its adoption in the industry. One of which, is that it is only needed in high end business solutions at the moment, preventing it from reaching smaller businesses.[[#Foot11|11]] But as newer features are added and the standards mature we will see an increased adoption.

It is obvious however that changes do need to occur as storage grows and finer levels of management are needed for data storage. Object-based storage has evolved to fit these needs where block-based storage has stagnated. The better tools for managing the data using the rich metadata of objects, the security and data transfer speeds of NAS and SAN combined and integrity controls for backups and redundancies will be an attractive choice for storage administrators in the future.

==References==
1 Dell Product Group, 2010. Object Storage A Fresh Approach to Long-Term File Storage. [online] Dell Available at: <http://www.dell.com/downloads/global/products/pvaul/en/object-storage-overview.pdf> [Accessed 13 October 2010].

2 C. Bandulet, 2007. Object-Based Storage Devices. [online] Oracle Available at: <http://developers.sun.com/solaris/articles/osd.html> [Accessed 13 October 2010].

3 IBM 350 disk storage unit, IBM Archives. [online] IBM Available at : <http://www-03.ibm.com/ibm/history/exhibits/storage/storage_350.html> [Accessed 14 October 2010].

4 M. Mesnier, G. R. Ganger, and E. Riedel. Object-Based Storage. IEEE Communications Magazine, 41(8), August 2003.

5 TechRepublic Guest Contributor, Foundations of Network Storage, Lesson Two: NAS. [online] Available at <http://articles.techrepublic.com.com/5100-22_11-5841266.html> [Accessed 14 October 2010].

6 Satran and Teperman, Object Store Based SAN File Systems. [online] IBM Labs Available at: <http://www.research.ibm.com/haifa/projects/storage/zFS/papers/amalfi.pdf> [Accessed 14 October 2010].

7 J. Tate, F. Lucchese, R. Moore. Introduction to Storage Area Networks. [online] Available at <http://www.redbooks.ibm.com/redbooks/pdfs/sg245470.pdf> [Accessed 14 October 2010].

8 H. Yoshida. LUN Security Considerations for Storage Area Networks. [online] Available at <http://www.it.hds.com/pdf/wp91_san_lun_secur.pdf> [Accessed 14 October 2010].

9 M. Factor, D. Nagle, D. Naor, E. Riedel, J.Satran, 2005. The OSD Security Protocol. [online] Available at <http://www.research.ibm.com/haifa/projects/storage/objectstore/papers/OSDSecurityProtocol.pdf> [Accessed 14 October 2010].

10 S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long,
and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proc. OSDI, 2006. [online] Available at: <http://www.usenix.org/events/osdi06/tech/full_papers/weil/weil_html/> [Accessed 14 October 2010].

11 M. Factor, K. Meth, D. Naor, O. Rodeh, J. Satran, 2005. Object storage: The future building block for storage systems. In 2nd International IEEE Symposium on Mass Storage Systems and Technologies, Sardinia [online] Available at: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.3959&rep=rep1&type=pdf> [Accessed 13 October 2010].

Talk:COMP 3000 Essay 1 2010 Question 11

2010-10-15T02:39:49Z

Mbingham: /* Last minute changes */

== Last minute changes ==
Ok guys, so its due early tomorrow. We have the essay pretty much completed aside from a few things.

First. Are we getting rid of the headings? Other groups have them in at the moment, I know the prof said the essay should read as if they weren't there but it might not hurt for them to be there.

Second. The essay needs to flow better. Some intro and outro sentences acknowledging the next section and refering to the previous ones would be nice.

Otherwise, what else remains?
--[[User:Smcilroy|Smcilroy]] 23:12, 14 October 2010 (UTC)

I'm trying to cleanup the references, is this format acceptable? --[[User:Dagar|Dagar]] 23:45, 14 October 2010 (UTC)
: Yes, that looks alot better --[[User:Smcilroy|Smcilroy]] 00:34, 15 October 2010 (UTC)

::I think we can keep some of the main headings, but I don't think we need them all. I think the real meat of the essay is in the comparisons with networked storage like NAS and especially SAN, so those sections should probably have headings of some kind. I also agree on the flow needing some work, some of the sections have a bit of overlap.

::Anil had mentioned to me today an example of a networked file system based on object store devices - [http://ceph.newdream.net/about/ Ceph]. [http://www.usenix.org/events/osdi06/tech/full_papers/weil/weil_html/ here is the full paper] on the system. I was thinking it might be worth it to mention it at least, maybe even have a small section about it, just so we get in a real world example of this technology. What do you guys think?

::--[[User:Mbingham|Mbingham]] 01:56, 15 October 2010 (UTC)

::Heres a quick example section, I know this is pretty last minute but what do you guys think?

::Ceph is an example of a real world networked storage system based around OSDs. The Ceph developers specifically list performance, reliability, and scalability as the benefits their system offers over current solutions. (insert reference to paper) Since Ceph is based on OSDs, it takes advantage of the ability for clients to interact directly with the devices, which avoids the traditional bottlenecks to performance caused by SAN controllers or NAS heads. This direct access allows Ceph to support a very large number of clients concurrently accessing data on the system. Since objects have security controls it can allow this direct access safely, unlike other network storage architectures.

::--[[User:Mbingham|Mbingham]] 02:09, 15 October 2010 (UTC)

::Also (sorry for all the comments), where does the first sentence of the Security section come from? It sounds like something that should be referenced, and seems kind of out of place because I don't think those four "quadrants" are brought up again?

::--[[User:Mbingham|Mbingham]] 02:11, 15 October 2010 (UTC)

::: Ok if Anil mentioned it, it's probably a good idea to include it, maybe after the 3 comparisons. I got an email back from Anil and he said that headings are OK as long as they add to the essay. So I think we can leave them in. --[[User:Smcilroy|Smcilroy]] 02:30, 15 October 2010 (UTC)

::::Cool, I added the section in. --[[User:Mbingham|Mbingham]] 02:39, 15 October 2010 (UTC)

== Tightening up the Intro ==
Hey everyone,

I think it might be useful to re-work the intro a bit so that it better represents the direction the essay has taken since then. Heres a quick mockup of a reworked intro. It could be expanded on in some parts and worked on, etc. I would like any comments, if you guys think this better represents the essay, or what you think needs changing in the introduction. Here it is:

:Storage needs have evolved over the past 60 years, and as a result the functionality expected from filesystems and storage solutions has evolved as well. The low level interface that a storage device implements, however, has remained mostly the same. A block based interface is still the most common mechanism for accessing storage devices. Recently, however, especially with the growth of networked storage architectures such as NAS and SAN, this interface needs to be reworked to accomodate changing needs. Object based storage is increasingly becoming an attractive alternative to block based storage. The design of object based storage devices (OSD), which store objects rather than blocks, easily associates data with meta-data. Objects are created, destroyed, read to, and written from, as well as carrying a unique ID. The device itself manages the physical space and can handle security on a per-object level. A storage network which is based on OSDs can provide better scalability without bottlenecks, better security with per-object access controls, and better integrity with unique has keys. In this way, the OSD interface is looking increasingly attractive as a building block for filesystems, especially in the context of netwoked storage.

I think the main thing is that it brings up networked storage earlier and puts a bit more focus on it. I think the main arguments for object based storage is its applicability to large storage networks, and the advantages it has over block based architectures. For this reason I think the intro should put a bit more focus on it. Does that make sense? Any comments or suggestions you guys have are welcome.

--[[User:Mbingham|Mbingham]] 21:18, 14 October 2010 (UTC)

:I know what you mean, putting a focus on network storage is a good idea. Let me see if I can add your suggestions to the intro and maybe combine the two.--[[User:Smcilroy|Smcilroy]] 23:12, 14 October 2010 (UTC)

== Wikipedia Sources ==
I think we may want to replace the references to wikipedia with something more authoritative. [http://www.redbooks.ibm.com/abstracts/sg245470.html?Open this massive pdf] from IBM supports the idea that fiber channels are the dominant infrastructure of SANs, but i'm not sure if it mentions how that is changing.

The wikipedia page for LUN masking has [http://www.sansecurity.com/san-security-faq.shtml this] as its reference for the definitions, there's also [http://technet.microsoft.com/en-us/library/cc758640(WS.10).aspx this] microsoft article and [http://www.it.hds.com/pdf/wp91_san_lun_secur.pdf this] paper from Hitachi. I'm not sure which of these is most relevant since I just did a quick google search and haven't really read up on LUN masking or zoning, so someone else would probably be better suited to decide which one if any to use.

How does that sound to everyone?

--[[User:Mbingham|Mbingham]] 02:55, 14 October 2010 (UTC)

:I agree, the Wikipedia references need to go. Whoever included those references should be able to find alternate sources from the one's you gave. --[[User:Smcilroy|Smcilroy]] 17:45, 14 October 2010 (UTC)

== Some Sourcing Issues and Other Stuff ==
Just a reminder, if we're taking direct quotes from a source they need to be in quotation marks and attributed with the authors name and the date (I think) in parenthesis at the end, not just a link or footnote reference. There was an issue with this in the first couple sentences of the scalability section. I've put it in quotes (though I didn't see any authors listed so I just put the company), but I think that that information might be better worked into the "Changing Storage Needs" section, what do you guys think?

Also, I think probably sometime today we should divide the rest of the sections up and try to get most of the content in so we have tomorrow for editing and combining the information so that it flows well. Again, any thoughts?

--[[User:Mbingham|Mbingham]] 19:32, 12 October 2010 (UTC)

: Sorry about the citation issue, you're right. I used the quote to emphasize the fact that scalability issues are evident in disk block systems. But now that I read it, it doesn't really transition well into the second paragraph. I don't mind if you move the quote to another section. Other than that, I could just finish up the section about Security. I don't really know who else is actively contributing to this essay though...or at least don't see anyone volunteering to take a topic other than Mbingham, Smcilroy and myself...
:--[[User:Myagi|Myagi]] 15:47, 12 October 2010 (UTC)

:No problem, it's just something to watch out for. I'll integrate it with the other section.
:Dagar has been making edits to the essay as well, he's cleaned up the language in some of the sections and organized the references. Maybe he would like to tackle one of the object specific sections?
:--[[User:Mbingham|Mbingham]] 20:02, 12 October 2010 (UTC)

::I apologize for the delay, this has been an easy thing to neglect during a busy week. What's the proper way to reference with this wiki? --[[User:Dagar|Dagar]] 21:29, 13 October 2010 (UTC)

:::check out this reference guide, it explain how to reference any material you find online. [http://libweb.anglia.ac.uk/referencing/harvard.htm Harvard System of Reference] --[[User:Smcilroy|Smcilroy]] 22:46, 13 October 2010 (UTC)

I'm going to finish up the Security section if nobody tags it by the end of today. I have a draft written up. The fact that more people aren't tagging the document outline and volunteering responsibilities is kind of unnerving...

--[[User:Myagi|Myagi]] 07:57, 13 October 2010 (UTC)

I'm going to expand the scalability and integrity sections. Then once the security section is done, I think that just leaves the section on the OSD standard and future plans for the tech. Then in the conclusion we can recap.
--[[User:Smcilroy|Smcilroy]] 22:54, 13 October 2010 (UTC)

:Sounds like a plan. I'll clean up/expand what I have written and get started with some initial stuff for the object sections. Anyone else is welcome to expand and edit as well.
:--[[User:Mbingham|Mbingham]] 00:44, 14 October 2010 (UTC)

== Essay Format and Assigned Tasks ==
So I added an intro and I did it like it was an essay and not a wiki article. Feel free to edit, expand and replace it as you see fit.
Also I think we should just list the topics we want to talk about and then people can put their name beside it and work on it, that way we don't have two people working on the same thing. Then we can edit it all so it fits together in the end. What do you think?
--[[User:Smcilroy|Smcilroy]] 15:16, 10 October 2010 (UTC)

:Sounds like a good idea. Here's a relatively quick list of topics to talk about, based on our discussions and the outline below. Add in any sections anyone thinks are missing and put your name beside areas you want:

:*Overview and history of block-based storage -Mbingham (I added a useful diagram here -Npradhan)
:*Block based storage standards - SCSI, SATA, ATA/IDE etc -Mbingham
:*Networked storage architectures: SAN and NAS -Smcilroy

:*How storage needs have changed since the development of block-based storage -Npradhan
:(maybe focus on the Internet, massive coorporate/government networks, large personal storage, etc)

:*Overview and History of object-based storage -Npradhan
:*Object-based storage standards (ANSI OSD specification)
:*Object-based storage applied to networked storage -dagar

:Comparison of object and block based stores focusing on:
::*Scalability -Myagi
::*Integrity -Myagi
::*Security -Myagi

:*Conclusion -Smcilroy

:Also, it would probably add it would be useful for people to be reading over each other's work and making suggestions, etc. I would also be cool with other people adding stuff to my sections if they have additional info or if there's something i've overlooked. There's 11 or 12 sections there, and I think there's six of us, so we can start off taking maybe 2 sections each, and then if we don't have all the sections covered we can divide them up later. How does that sound?
:--[[User:Mbingham|Mbingham]] 16:45, 10 October 2010 (UTC)

:Good plan, I took Scalability and Integrity comparisons of object and block stores.
:--[[User:Myagi|Myagi]] 13:26, 10 October 2010 (UTC)

== Initial Outline ==
'''Introduction'''
* Thesis Statement: Object stores are becoming more attractive because the demands on filesystems has changed and the block store interface has not been updated to accommodate these changes.
* What will be discussed
- Current state of block based storage
- Brief overview of object store
- Scalability
- Integrity
- Security

'''Block based storage'''
* NAS is a single storage device that is shared on a LAN
- File level/Single storage device(s) that operates individually
- Clients connect to the NAS head (interface between client and NAS) rather than to the individual storage devices
- Use small, specialized and proprietary operating systems instead of general purpose OSs
- Can enforce security constraints, quotas, indexing
- Example of access: \\NAS\Sharename

Advantages
- Dedicated, feature-rich file sharing
- Network optimized
- Centralized storage
- Less administration overhead
Disadvantages
- Metadata processing has to be handled on the NAS server
- Scaling up with more storage behind the NAS head is restricted because metadata processing on the NAS device becomes a bottleneck
- Scaling by adding additional NAS devices quickly becomes a management issue because data is isolated on individual NAS islands
- High latency protocols that clogs LANs, using TCP/IP
- Not suitable for data transfer intensive apps

* SAN filesystem is a local network of multiple devices that operate on disk blocks and provides a file system abstraction
- Block level/local network of multiple device
- Every client computer has its own file system
- A SAN alone does not provide the file abstraction but there is a file system built on top of SANs
- Example of access: D:\, E:\, etc.

Advantages
- High-performance shared disk
- Scalable
- Short I/O paths
- Lots of parallelism
Disadvantages
- Harder to maintain, lots of file systems to manage
- Harder to administer, lots of storage access rights to coordinate

* OSDs closes the gap between the scalability of SAN and the file sharing capabilities of NAS
* Block storage has limitations that have become more apparent as demand for scalability and security has grown

'''Overview of OSD'''
* An OSD device deals in objects
- Handles the mapping from object to physical media locations itself
- Tracks metadata as attributes, such as creation timestamps, allowing for easier sharing of data among clients
- OSDs are directly connected to clients without the need for an intermediary to handle metadata.

* ANSI ratified version 1.0 of the OSD specification in 2004, defining a protocol for communication with object-based storage devices
* The OSD specification describes:
- a SCSI command set that provides a high-level interface to OSD devices
- how file systems and databases stores and retrieves data objects
- work has continued in ratifying OSD-2 and OSD-3 specificiations

'''Scalability'''
* Metadata is associated and stored directly with data objects and carried between layers and across devices
* Space allocation delegated to storage device
* Server has reduced overhead and processing, allowing larger clusters of storage

'''Integrity'''
* OSD's have knowledge of its object layout
* Unlike block stores, OSD's can recover data specific to a byte range
- OSD's know what space is being unused in this way
- Can scan and correct errors without losing data
* OSD's maintain internal copies of metadata
- User doesn't have to do a complete file system restore for the sake of one or few unrecoverable files
- OSD's can identify the byte range lost and restore the file efficiently

'''Security'''
* Suited for network based storage
* Associate security attributes directly with data object
* Security requests handled directly by storage device
* Computer system can access OSD device by providing cryptographically secure credentials(capability) that the OSD device can validate
- This can prevent malicious access from unauthorized requests or accidental access from misconfigured machines

'''Conclusion'''
* Reiteration of thesis statement

--[[User:Myagi|Myagi]] 18:15, 7 October 2010 (UTC)

Hey Myagi, I thought i'd move your outline to its own section at the top of the page so it's more visible. I hope you don't mind. If you do, feel free to revert this edit.

--[[User:Mbingham|Mbingham]] 02:31, 8 October 2010 (UTC)

: It's all good.
:--[[User:Myagi|Myagi]] 10:00, 8 October 2010 (UTC)

:This outline looks pretty good to me. I like the three focus points of scalability, integrity and security, those seem to be constant themes in what i've read about object stores.

:For the block storage overview, the two current standards for a block based interface seem to be SCSI and SATA. SCSI seems to be used more in enterprise storage and SATA more in personal storage (someone correct me if i'm wrong here). We might also want to take a look at SAN and NAS. I need to do some more reading, haha.

:Also, I think we might as well start putting up some stuff on the article page. Even just a few sentences per section. I can start on that tomorrow or maybe Saturday. Of course any one else is welcome to as well.

:--[[User:Mbingham|Mbingham]] 02:31, 8 October 2010 (UTC)

== Quick Overview ==
So I hope i'm not the only one who was wondering "What are object stores?" when reading the question. I don't think the textbook mentions it but I didn't read through the filesystems chapter very thoroughly. Here's where some quick googling has got me:

Most storage devices divide their storage up into blocks, a fixed length sequence of bytes. The interface that storage devices provide to the rest of the system is pretty simple. It's essentially "Here, you can read to or write to blocks, have fun". This is block-based storage.

Object-based storage is different. The interface it presents to the rest of the system is more sophisticated. Instead of directly accessing blocks on the disk, the system accesses objects. Objects are like a level of abstraction on top of blocks. Objects can be variable sized, read/written to, created, and deleted. The device itself handles mapping these objects to blocks and all the issues that come with that, rather than the OS.

Here's some papers that give an overview of object-based storage:

[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1612479 Object Storage: The Future Building Block for Storage Systems]

[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1222722 Object-Based Storage]

I think if you just look those up on google scholar you can access the pdf without even being inside carleton's network.

--[[User:Mbingham|Mbingham]] 23:56, 1 October 2010 (UTC)

== Some more links ==
I haven't been reading many academic papers on the subject so those links will be very useful.

If I may add to this. I read articles on object storage here:

[http://www.dell.com/downloads/global/products/pvaul/en/object-storage-overview.pdf Object Storage Overview]

and

[http://www.snia.org/education/tutorials/2010/spring/file/PaulMassiglia_File_Systems_Object_Storage_Devices.pdf File Systems for OSD's]

I can add that metadata is much richer in an object store context. Searching for files and grouping related files together is much easier with the context information that metadata supplies for objects. I'm beginning to read:

[http://www.seagate.com/docs/pdf/whitepaper/tp_536.pdf The advantages of OSD's]

--[[User:Myagi|Myagi]] 10:39, 5 October 2010 (UTC)

I'm going to write a version of my essay out over the long weekend with headings and references and put it up on the wiki. I'd like to know who and how many people are working on this essay but dunno if that's possible. We'll see what we do from there I guess? I was thinking we just homogenize all of the information we write into one unified essay.

--[[User:Myagi|Myagi]] 10:42, 6 October 2010 (UTC)

:I think there's 6 people in our group, though there might only be 5. I'll be working on this over the long weekend too. I was thinking maybe we should try to get a rough outline up, thursday or friday. Since Prof Somayaji mentioned that this should have the format of an essay, maybe we could start with what our main argument is?

:I was thinking something like objects stores are becoming more attractive because the demands on filesystems has changed, but the interface has not been updated to accomodate these changes. Then we could go into an explanation of block based storage, how it fails to meet the needs placed on modern FSs, then how object stores solves these problems. What do you think?

:--[[User:Mbingham|Mbingham]] 01:55, 7 October 2010 (UTC)

:You don't need to write your own independent essay on the wiki. Let's just add info as it comes along. I'll be completely without internet access this weekend, but I'll try to bring some background reading with me. Expect lots of edits from me starting Monday night/Tuesday morning.
:--[[User:Dagar|Dagar]] 12:59, 7 October 2010 (UTC)

:Sounds good! I think that's a good idea for a thesis statement and we should have a concrete one by Thurs/Fri. Although I'm not absolutely clear about the interface not being updated? I think the object store SCSI standard is constantly being ratified and now they have an OSD-3 draft. [http://www.t10.org/drafts.htm#OSD_Family T10 OSD Working Drafts]. But then again I'm probably misunderstanding something...
:--[[User:Myagi|Myagi]] 10:08, 7 October 2010 (UTC)

::I didn't mean that the object interface hadn't been updated, I meant that the block interface hasn't been updated to reflect the changing requirements put on storage. Since the block interface is still largely the same as it was decades ago (read/write to blocks) it is unable to handle the new requirements. Object stores look attractive because they are designed to deal with issues like scalability, integrity, security, etc. Sorry for the confusion, I hope it makes more sense now, haha.
::--[[User:Mbingham|Mbingham]] 15:44, 7 October 2010 (UTC)

:I gotcha, thanks for explaining! I'd say that would be a great thesis statement then: Object stores are becoming more attractive because the demands on filesystems has changed and the block store interface has not been updated to accommodate these changes. We can work from there. I think we can address the inadequacies of block based storage after stating our thesis and then for the body, we point out how object stores deal with issues of scalability, integrity, security as well as flexibility. And then some kind of nice tie up reiterating our thesis.
:--[[User:Myagi|Myagi]] 12:50, 7 October 2010 (UTC)

I mine as well put my contribution here. I'm willing to move or change it for the sake of organizing this discussion page.

--[[User:Myagi|Myagi]] 18:15, 7 October 2010 (UTC)

:(moved Myagi's outline to top of page) --[[User:Mbingham|Mbingham]] 02:31, 8 October 2010 (UTC)

Some links that I found while doing the assignment about object storage and its application to SAN systems:
http://dsc.sun.com/solaris/articles/osd.html
http://www.research.ibm.com/haifa/projects/storage/zFS/papers/amalfi.pdf

--[[User:Npradhan|Npradhan]] 23:45, 9 October 2010 (UTC)

== Other ==
-instead of storing filesytems in terms of blocks, you store in terms of objects.

-extents, named extents

-objects fancier because they can move around.

-extra level of abstraction and indirection

-files made of objects, objects made of blocks

COMP 3000 Essay 1 2010 Question 11

2010-10-15T02:38:28Z

Mbingham: /* Overview of Block-Based Storage */ Direct Quotes need the authors name and year in parenthesis afterwards

=Question=

Why are object stores an increasingly attractive building block for filesystems (as opposed to block-based stores)? Explain.

=Answer=

== Introduction ==

Each year we are faced with growing storage needs as the world's information increases exponentially, business' are increasingly choosing to archive and retain all the data they produce and "store everything, forever"[[#Foot1|1]] is the common mantra of storage administrators. The storage industry has been able to keep up with the increasing demand with matching increases in storage capacity. Unfortunately the interfaces between clients and storage devices has remained unchanged since the 1950's. The dominate storage mechanism is still block-based storage technology.

Innovation in storage technology is especially pertinent to businesses that use network storage. The two dominant technologies of network storage: storage area network (SAN) and network-attached storage (NAS) each with their own benefits and drawbacks would benefit greatly with improvement in storage technology. Improvements that can provide better scalability, business intelligence, and management while ensuring security and data access speed of traditional storage solutions would be ideal.

Object Based Storage Devices (OSD) solve these issues because of how they are designed. Object storage uses objects that consists of data and meta-data that describe the object. They are accessed with defined methods such as read and write and carry a unique ID. They handle the underlying security, space allocation and basic storage routines.[[#Foot2|2]] This storage technology has the potential to address some of the problems with block-based storage.

With increased scalability, better security through per-object level access, ensured integrity of data with unique hash key's and benefits in management and business intelligence with rich meta-data, OSD can be seen as a viable alternative to improve the standard architectures of SAN and NAS.

== Overview of Block-Based Storage ==

Hard disks as a storage medium date back to the 1950s with the introduction of the IBM 350 disk storage unit.[[#Foot3|3]] Hard disks store data in blocks, which are a fixed length series' of bytes. Since early devices like the IBM 350, the interface that the operating system uses to communicate with the hard disk has remained mostly the same.[[#Foot4|4]] This interface simply allows the operating system to read or write to blocks on the disk. This means that the goal of abstracting stored data into related groups or into human-understandable constructs such as objects or files is left completely in the space of the operating system's filesystem. For example, when the filesystem wants to write data to a file it must translate that into a block on the disk to write to. In this way, the scope of a filesystem extends from high level constructs like files to low level constructs like blocks. This wide scope is necessary because of the simple interface presented to the filesystem that must be abstracted up to the complex expectations of a user.

Multiple standards exist to implement this interface. The small computer system interface (SCSI) standards, which have been around in one form or another since the late 1970s, are popular with industry. Parallel ATA, another standard which was designed in the 1980s, continues today in the form of Serial ATA (SATA). However, even though these standards have been around for a long time, "the logical interface, or the command set, has seen only minor additions" (Bandulet, 2007)[[#Foot2|2]]. This means that the functionality that the command set allows has also remained mostly the same, since the functionality must be built on top of these dated commands.

== Overview of Object-Based Storage ==

Unlike block-based storage, object-based storage research started in 1990s. See for example the work of Gibson et al in "A Cost-Effective, High-Bandwidth Storage Architecture", Proceedings of the 8th Conference on Architectural Support for Programming Languages and Operating Systems, 1998. The fundamental idea of an object based storage device is to have the storage device itself handle a layer of abstraction on top of the block. Instead of the interface presenting the filesystem with blocks to read and write to, the interface presents the filesystem with "objects" which it can read to, write to, create, or destroy. Objects can be variable sized, and the device itself handles mapping onto physical memory. These objects also have metadata and access controls immediately associated with them. This allows the filesystem to work at a higher level of abstraction. This is important because the needs placed on filesystems have changed, and we will see as we compare object based storage with block based storage that the design of objects are more suited to the needs of today's filesystems, especially networked filesystems, than blocks.

== Changing Storage Needs ==

Storage needs have changed significantly since the first hard disks were developed in the 1950s, and the standardization of the interface in the 1970s. This means that the functionality of storage devices must also change to reflect these needs. Storage has become increasingly networked. Networked storage must deal with several issues. Firstly, the storage architecture must be able to scale to terabytes of data and beyond with many servers and clients while avoiding bottlenecks. The data stored on these networks has also become more sensitive. Personal information, such as financial, is stored in large databases. Sensitive corporate and governmental information is stored similarly. Since the value of data has increased, it becomes more important to ensure the data's integrity and security. Block based storage, as we will see, has difficulty dealing with these priorities because of limitations inherent in its design. Object based storage is more suited to address these issues by design.

== Comparison of object and block based stores ==
=== Scalability ===
Scalability is very important for large businesses that need to manage large data centers. Managing metadata while ensuring data access speed as the systems grows is paramount.

Most block based storage systems contain many layers of metadata. There are also various types of virtualized systems that contain metadata to deal with device diversity or remapping of blocks for archiving or duplication. Building systems to scale with the metadata becomes a major issue. But at the same time the current speeds of block-based storage needs to be maintained.

NAS is a file system that coordinates the interface between file blocks and the clients access to files. This is done through a single NAS head which usually has thousands of gigabytes of storage behind it.[[#Foot5|5]] All data traffic must flow through this single access point. The benefits of the NAS file system is through its ability to set block access, manage security, prevent unauthorized access to files and use metadata to map blocks into files for the client. However, this causes a bottleneck issue with all the data passing through one point. Another issue is managing the metadata. Metadata is shared among separate metadata servers remote from the hosts. Space allocation management on different storage system layers and applications that add policy and management metadata individually is spread throughout the system. So this results in the metadata becoming very hard to manage.

SAN's on the other hand offer file systems that are distributed, but provide a single system image of the file system. This means that a local user need not be concerned with where the data is physically stored, since a level of abstraction separates the user from the physical location of the data. This eliminates the bottleneck of NAS. In the past, SANs were implemented on private fiber channel networks, which were designed to emulate local storage media. As long as the network remained exclusive, it could be assumed that all the clients could be trusted, so security was not a primary concern. The lack of security concern is one of the main reasons that block storage was a viable option for SANs of the past. Modern SANs can serve a much larger set of users, not all of whom can or should be trusted. This, in addition to the possible adoption of IP based SAN solutions, make data security a primary concern[[#Foot6|6]]. Object stores can make user privilege management a much more manageable task, since each object can 'know' who is allowed to access it.

Object storage provides the ability to operate a SAN setup with direct access to data while offering better security and scalability with metadata. Each object comes with a set of access rules given to it by the management server and metadata is associated and stored directly with each data object and is automatically carried between layers and across devices. Space allocation and management metadata are the responsibility of the storage device.[[#Foot1|1]] This allows metadata layers to be folded, reducing server overhead and processing, and allows for larger clusters of storage compared with traditional block-based interfaces.

=== Integrity ===
Block based file systems in archive solutions usually have no built in mechanisms for assuring data integrity. A common best practice is to conduct frequent backups, which adds to the complexity of using file systems for archiving and scalability. The mechanisms for ensuring data integrity in OSDs have mechanisms that operate differently from block store systems.

One of the major problems with storage at the block level is that if there is an error in a block, it is almost impossible to determine what part of the file system is affected. It may be the case that the error in a particular block may not even contain any data. This usually happens during a backup procedure or when a controller is organizing data.

OSDs provide a level of abstraction that hides the fact that a disk device has blocks. It no longer matters to the file system manager what kind of disk drive is being used, it only worries about managing objects. This is done through managing metadata as well as maintaining internal copies of its metadata. Hence, OSDs have knowledge of its object layout even though one or more groups of objects are on different OSDs. In this way OSDs know what kind of space is being used or unused and can scan and correct errors without losing data. In the event of a failure in recovering a file or a number of files, traditional systems may have to do a complete file system restore. However, an OSDs awareness of its object layout enables it to recover data specific to a byte range and thus restore files in an efficient manner.

OSDs have another powerful feature. Each object file has an associated hash key that is generated uniquely to the contents of the file. Thus the file can be verified for accuracy to ensure the contents remain the same and integrity to ensure the data has not been corrupted. Also it can be used for management of data to flag duplicate data.[[#Foot1|1]]

=== Security ===

Security threats can be thought of as having four quadrants. External, internal, accidental and malicious. Block based stores have a variety of ways for handling security but there are basic concepts that SAN and NAS technologies use to secure data.

SAN has traditionally run on fibre channels. [[#Foot7|7]] For the sake of security, running a SAN on fibre channels help isolate its network as they do not communicate over TCP/IP connections. However, since the SAN devices themselves do not restrict access, it's up to the network infrastructure and host system to handle its security.

Zoning and LUN masking are typical ways SAN systems could use as security measures. Zoning allocates a certain amount of storage to clients. These zones are isolated and are not allowed to communicate outside their respective zone. LUN masking is similar to zoning, however, they differ in the type of devices being used. Switches utilize zoning while disk array controllers use LUN masking. A disk array controller is a device which manages the physical disk drives and interprets them as logical unit numbers. Thus, the term LUN masking.[[#Foot8|8]]

NAS has its own vulnerabilities but as with SAN, it is only as secure as the network they operate on. NAS security is conceptually simpler than SAN. NAS environments can administer security tasks as well as control disk usage quotas. The proprietary operating system it runs on has access control configurations much like other traditional OSs that can prevent unauthorized access to data.

Unlike NAS and SAN systems, OSD devices handle security requests directly. The set of protocols used by OSD enable it to cover the four quadrants of security threats outlined above. Clients can access an OSD device by providing "cryptographically secure credentials", called capabilities, which specify a tuple (OSD name, partition ID, object ID) to identify the object.[[#Foot9|9]] This can prevent accidental or even malicious access to an OSD externally or internally.

== Real World Implementation ==

Ceph is an example of a real world networked storage system based around OSDs. The Ceph developers specifically list performance, reliability, and scalability as the benefits their system offers over current solutions.[[#Foot10|10]] Since Ceph is based on OSDs, it takes advantage of the ability for clients to interact directly with the devices, which avoids the traditional bottlenecks to performance caused by SAN controllers or NAS heads. This direct access allows Ceph to support a very large number of clients concurrently accessing data on the system. Since objects have security controls it can allow this direct access safely, unlike other network storage architectures.

== Conclusion ==
Although object storage is relatively new compared to block storage, work as progressed steadily in universities and on standards such as the ANSI T10 SCSI OSD standard. But there remains challenges to its adoption in the industry. One of which, is that it is only needed in high end business solutions at the moment, preventing it from reaching smaller businesses.[[#Foot11|11]] But as newer features are added and the standards mature we will see an increased adoption.

It is obvious however that changes do need to occur as storage grows and finer levels of management are needed for data storage. Object-based storage has evolved to fit these needs where block-based storage has stagnated. The better tools for managing the data using the rich metadata of objects, the security and data transfer speeds of NAS and SAN combined and integrity controls for backups and redundancies will be an attractive choice for storage administrators in the future.

==References==
1 Dell Product Group, 2010. Object Storage A Fresh Approach to Long-Term File Storage. [online] Dell Available at: <http://www.dell.com/downloads/global/products/pvaul/en/object-storage-overview.pdf> [Accessed 13 October 2010].

2 C. Bandulet, 2007. Object-Based Storage Devices. [online] Oracle Available at: <http://developers.sun.com/solaris/articles/osd.html> [Accessed 13 October 2010].

3 IBM 350 disk storage unit, IBM Archives. [online] IBM Available at : <http://www-03.ibm.com/ibm/history/exhibits/storage/storage_350.html> [Accessed 14 October 2010].

4 M. Mesnier, G. R. Ganger, and E. Riedel. Object-Based Storage. IEEE Communications Magazine, 41(8), August 2003.

5 TechRepublic Guest Contributor, Foundations of Network Storage, Lesson Two: NAS. [online] Available at <http://articles.techrepublic.com.com/5100-22_11-5841266.html> [Accessed 14 October 2010].

6 Satran and Teperman, Object Store Based SAN File Systems. [online] IBM Labs Available at: <http://www.research.ibm.com/haifa/projects/storage/zFS/papers/amalfi.pdf> [Accessed 14 October 2010].

7 J. Tate, F. Lucchese, R. Moore. Introduction to Storage Area Networks. [online] Available at <http://www.redbooks.ibm.com/redbooks/pdfs/sg245470.pdf> [Accessed 14 October 2010].

8 H. Yoshida. LUN Security Considerations for Storage Area Networks. [online] Available at <http://www.it.hds.com/pdf/wp91_san_lun_secur.pdf> [Accessed 14 October 2010].

9 M. Factor, D. Nagle, D. Naor, E. Riedel, J.Satran, 2005. The OSD Security Protocol. [online] Available at <http://www.research.ibm.com/haifa/projects/storage/objectstore/papers/OSDSecurityProtocol.pdf> [Accessed 14 October 2010].

10 S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long,
and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proc. OSDI, 2006. [online] Available at: <http://www.usenix.org/events/osdi06/tech/full_papers/weil/weil_html/> [Accessed 14 October 2010].

11 M. Factor, K. Meth, D. Naor, O. Rodeh, J. Satran, 2005. Object storage: The future building block for storage systems. In 2nd International IEEE Symposium on Mass Storage Systems and Technologies, Sardinia [online] Available at: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.3959&rep=rep1&type=pdf> [Accessed 13 October 2010].

COMP 3000 Essay 1 2010 Question 11

2010-10-15T02:34:38Z

Mbingham: /* Real World Implementation */

=Question=

Why are object stores an increasingly attractive building block for filesystems (as opposed to block-based stores)? Explain.

=Answer=

== Introduction ==

Each year we are faced with growing storage needs as the world's information increases exponentially, business' are increasingly choosing to archive and retain all the data they produce and "store everything, forever"[[#Foot1|1]] is the common mantra of storage administrators. The storage industry has been able to keep up with the increasing demand with matching increases in storage capacity. Unfortunately the interfaces between clients and storage devices has remained unchanged since the 1950's. The dominate storage mechanism is still block-based storage technology.

Innovation in storage technology is especially pertinent to businesses that use network storage. The two dominant technologies of network storage: storage area network (SAN) and network-attached storage (NAS) each with their own benefits and drawbacks would benefit greatly with improvement in storage technology. Improvements that can provide better scalability, business intelligence, and management while ensuring security and data access speed of traditional storage solutions would be ideal.

Object Based Storage Devices (OSD) solve these issues because of how they are designed. Object storage uses objects that consists of data and meta-data that describe the object. They are accessed with defined methods such as read and write and carry a unique ID. They handle the underlying security, space allocation and basic storage routines.[[#Foot2|2]] This storage technology has the potential to address some of the problems with block-based storage.

With increased scalability, better security through per-object level access, ensured integrity of data with unique hash key's and benefits in management and business intelligence with rich meta-data, OSD can be seen as a viable alternative to improve the standard architectures of SAN and NAS.

== Overview of Block-Based Storage ==

Hard disks as a storage medium date back to the 1950s with the introduction of the IBM 350 disk storage unit.[[#Foot3|3]] Hard disks store data in blocks, which are a fixed length series' of bytes. Since early devices like the IBM 350, the interface that the operating system uses to communicate with the hard disk has remained mostly the same.[[#Foot4|4]] This interface simply allows the operating system to read or write to blocks on the disk. This means that the goal of abstracting stored data into related groups or into human-understandable constructs such as objects or files is left completely in the space of the operating system's filesystem. For example, when the filesystem wants to write data to a file it must translate that into a block on the disk to write to. In this way, the scope of a filesystem extends from high level constructs like files to low level constructs like blocks. This wide scope is necessary because of the simple interface presented to the filesystem that must be abstracted up to the complex expectations of a user.

Multiple standards exist to implement this interface. The small computer system interface (SCSI) standards, which have been around in one form or another since the late 1970s, are popular with industry. Parallel ATA, another standard which was designed in the 1980s, continues today in the form of Serial ATA (SATA). However, even though these standards have been around for a long time, "the logical interface, or the command set, has seen only minor additions"[[#Foot2|2]]. This means that the functionality that the command set allows has also remained mostly the same, since the functionality must be built on top of these dated commands.

== Overview of Object-Based Storage ==

Unlike block-based storage, object-based storage research started in 1990s. See for example the work of Gibson et al in "A Cost-Effective, High-Bandwidth Storage Architecture", Proceedings of the 8th Conference on Architectural Support for Programming Languages and Operating Systems, 1998. The fundamental idea of an object based storage device is to have the storage device itself handle a layer of abstraction on top of the block. Instead of the interface presenting the filesystem with blocks to read and write to, the interface presents the filesystem with "objects" which it can read to, write to, create, or destroy. Objects can be variable sized, and the device itself handles mapping onto physical memory. These objects also have metadata and access controls immediately associated with them. This allows the filesystem to work at a higher level of abstraction. This is important because the needs placed on filesystems have changed, and we will see as we compare object based storage with block based storage that the design of objects are more suited to the needs of today's filesystems, especially networked filesystems, than blocks.

== Changing Storage Needs ==

Storage needs have changed significantly since the first hard disks were developed in the 1950s, and the standardization of the interface in the 1970s. This means that the functionality of storage devices must also change to reflect these needs. Storage has become increasingly networked. Networked storage must deal with several issues. Firstly, the storage architecture must be able to scale to terabytes of data and beyond with many servers and clients while avoiding bottlenecks. The data stored on these networks has also become more sensitive. Personal information, such as financial, is stored in large databases. Sensitive corporate and governmental information is stored similarly. Since the value of data has increased, it becomes more important to ensure the data's integrity and security. Block based storage, as we will see, has difficulty dealing with these priorities because of limitations inherent in its design. Object based storage is more suited to address these issues by design.

== Comparison of object and block based stores ==
=== Scalability ===
Scalability is very important for large businesses that need to manage large data centers. Managing metadata while ensuring data access speed as the systems grows is paramount.

Most block based storage systems contain many layers of metadata. There are also various types of virtualized systems that contain metadata to deal with device diversity or remapping of blocks for archiving or duplication. Building systems to scale with the metadata becomes a major issue. But at the same time the current speeds of block-based storage needs to be maintained.

NAS is a file system that coordinates the interface between file blocks and the clients access to files. This is done through a single NAS head which usually has thousands of gigabytes of storage behind it.[[#Foot5|5]] All data traffic must flow through this single access point. The benefits of the NAS file system is through its ability to set block access, manage security, prevent unauthorized access to files and use metadata to map blocks into files for the client. However, this causes a bottleneck issue with all the data passing through one point. Another issue is managing the metadata. Metadata is shared among separate metadata servers remote from the hosts. Space allocation management on different storage system layers and applications that add policy and management metadata individually is spread throughout the system. So this results in the metadata becoming very hard to manage.

SAN's on the other hand offer file systems that are distributed, but provide a single system image of the file system. This means that a local user need not be concerned with where the data is physically stored, since a level of abstraction separates the user from the physical location of the data. This eliminates the bottleneck of NAS. In the past, SANs were implemented on private fiber channel networks, which were designed to emulate local storage media. As long as the network remained exclusive, it could be assumed that all the clients could be trusted, so security was not a primary concern. The lack of security concern is one of the main reasons that block storage was a viable option for SANs of the past. Modern SANs can serve a much larger set of users, not all of whom can or should be trusted. This, in addition to the possible adoption of IP based SAN solutions, make data security a primary concern[[#Foot6|6]]. Object stores can make user privilege management a much more manageable task, since each object can 'know' who is allowed to access it.

Object storage provides the ability to operate a SAN setup with direct access to data while offering better security and scalability with metadata. Each object comes with a set of access rules given to it by the management server and metadata is associated and stored directly with each data object and is automatically carried between layers and across devices. Space allocation and management metadata are the responsibility of the storage device.[[#Foot1|1]] This allows metadata layers to be folded, reducing server overhead and processing, and allows for larger clusters of storage compared with traditional block-based interfaces.

=== Integrity ===
Block based file systems in archive solutions usually have no built in mechanisms for assuring data integrity. A common best practice is to conduct frequent backups, which adds to the complexity of using file systems for archiving and scalability. The mechanisms for ensuring data integrity in OSDs have mechanisms that operate differently from block store systems.

One of the major problems with storage at the block level is that if there is an error in a block, it is almost impossible to determine what part of the file system is affected. It may be the case that the error in a particular block may not even contain any data. This usually happens during a backup procedure or when a controller is organizing data.

OSDs provide a level of abstraction that hides the fact that a disk device has blocks. It no longer matters to the file system manager what kind of disk drive is being used, it only worries about managing objects. This is done through managing metadata as well as maintaining internal copies of its metadata. Hence, OSDs have knowledge of its object layout even though one or more groups of objects are on different OSDs. In this way OSDs know what kind of space is being used or unused and can scan and correct errors without losing data. In the event of a failure in recovering a file or a number of files, traditional systems may have to do a complete file system restore. However, an OSDs awareness of its object layout enables it to recover data specific to a byte range and thus restore files in an efficient manner.

OSDs have another powerful feature. Each object file has an associated hash key that is generated uniquely to the contents of the file. Thus the file can be verified for accuracy to ensure the contents remain the same and integrity to ensure the data has not been corrupted. Also it can be used for management of data to flag duplicate data.[[#Foot1|1]]

=== Security ===

Security threats can be thought of as having four quadrants. External, internal, accidental and malicious. Block based stores have a variety of ways for handling security but there are basic concepts that SAN and NAS technologies use to secure data.

SAN has traditionally run on fibre channels. [[#Foot7|7]] For the sake of security, running a SAN on fibre channels help isolate its network as they do not communicate over TCP/IP connections. However, since the SAN devices themselves do not restrict access, it's up to the network infrastructure and host system to handle its security.

Zoning and LUN masking are typical ways SAN systems could use as security measures. Zoning allocates a certain amount of storage to clients. These zones are isolated and are not allowed to communicate outside their respective zone. LUN masking is similar to zoning, however, they differ in the type of devices being used. Switches utilize zoning while disk array controllers use LUN masking. A disk array controller is a device which manages the physical disk drives and interprets them as logical unit numbers. Thus, the term LUN masking.[[#Foot8|8]]

NAS has its own vulnerabilities but as with SAN, it is only as secure as the network they operate on. NAS security is conceptually simpler than SAN. NAS environments can administer security tasks as well as control disk usage quotas. The proprietary operating system it runs on has access control configurations much like other traditional OSs that can prevent unauthorized access to data.

Unlike NAS and SAN systems, OSD devices handle security requests directly. The set of protocols used by OSD enable it to cover the four quadrants of security threats outlined above. Clients can access an OSD device by providing "cryptographically secure credentials", called capabilities, which specify a tuple (OSD name, partition ID, object ID) to identify the object.[[#Foot9|9]] This can prevent accidental or even malicious access to an OSD externally or internally.

== Real World Implementation ==

Ceph is an example of a real world networked storage system based around OSDs. The Ceph developers specifically list performance, reliability, and scalability as the benefits their system offers over current solutions.[[#Foot10|10]] Since Ceph is based on OSDs, it takes advantage of the ability for clients to interact directly with the devices, which avoids the traditional bottlenecks to performance caused by SAN controllers or NAS heads. This direct access allows Ceph to support a very large number of clients concurrently accessing data on the system. Since objects have security controls it can allow this direct access safely, unlike other network storage architectures.

== Conclusion ==
Although object storage is relatively new compared to block storage, work as progressed steadily in universities and on standards such as the ANSI T10 SCSI OSD standard. But there remains challenges to its adoption in the industry. One of which, is that it is only needed in high end business solutions at the moment, preventing it from reaching smaller businesses.[[#Foot11|11]] But as newer features are added and the standards mature we will see an increased adoption.

It is obvious however that changes do need to occur as storage grows and finer levels of management are needed for data storage. Object-based storage has evolved to fit these needs where block-based storage has stagnated. The better tools for managing the data using the rich metadata of objects, the security and data transfer speeds of NAS and SAN combined and integrity controls for backups and redundancies will be an attractive choice for storage administrators in the future.

==References==
1 Dell Product Group, 2010. Object Storage A Fresh Approach to Long-Term File Storage. [online] Dell Available at: <http://www.dell.com/downloads/global/products/pvaul/en/object-storage-overview.pdf> [Accessed 13 October 2010].

2 C. Bandulet, 2007. Object-Based Storage Devices. [online] Oracle Available at: <http://developers.sun.com/solaris/articles/osd.html> [Accessed 13 October 2010].

3 IBM 350 disk storage unit, IBM Archives. [online] IBM Available at : <http://www-03.ibm.com/ibm/history/exhibits/storage/storage_350.html> [Accessed 14 October 2010].

4 M. Mesnier, G. R. Ganger, and E. Riedel. Object-Based Storage. IEEE Communications Magazine, 41(8), August 2003.

5 TechRepublic Guest Contributor, Foundations of Network Storage, Lesson Two: NAS. [online] Available at <http://articles.techrepublic.com.com/5100-22_11-5841266.html> [Accessed 14 October 2010].

6 Satran and Teperman, Object Store Based SAN File Systems. [online] IBM Labs Available at: <http://www.research.ibm.com/haifa/projects/storage/zFS/papers/amalfi.pdf> [Accessed 14 October 2010].

7 J. Tate, F. Lucchese, R. Moore. Introduction to Storage Area Networks. [online] Available at <http://www.redbooks.ibm.com/redbooks/pdfs/sg245470.pdf> [Accessed 14 October 2010].

8 H. Yoshida. LUN Security Considerations for Storage Area Networks. [online] Available at <http://www.it.hds.com/pdf/wp91_san_lun_secur.pdf> [Accessed 14 October 2010].

9 M. Factor, D. Nagle, D. Naor, E. Riedel, J.Satran, 2005. The OSD Security Protocol. [online] Available at <http://www.research.ibm.com/haifa/projects/storage/objectstore/papers/OSDSecurityProtocol.pdf> [Accessed 14 October 2010].

10 S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long,
and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proc. OSDI, 2006. [online] Available at: <http://www.usenix.org/events/osdi06/tech/full_papers/weil/weil_html/> [Accessed 14 October 2010].

11 M. Factor, K. Meth, D. Naor, O. Rodeh, J. Satran, 2005. Object storage: The future building block for storage systems. In 2nd International IEEE Symposium on Mass Storage Systems and Technologies, Sardinia [online] Available at: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.3959&rep=rep1&type=pdf> [Accessed 13 October 2010].

COMP 3000 Essay 1 2010 Question 11

2010-10-15T02:34:27Z

Mbingham: /* Conclusion */ fixing up references

=Question=

Why are object stores an increasingly attractive building block for filesystems (as opposed to block-based stores)? Explain.

=Answer=

== Introduction ==

Each year we are faced with growing storage needs as the world's information increases exponentially, business' are increasingly choosing to archive and retain all the data they produce and "store everything, forever"[[#Foot1|1]] is the common mantra of storage administrators. The storage industry has been able to keep up with the increasing demand with matching increases in storage capacity. Unfortunately the interfaces between clients and storage devices has remained unchanged since the 1950's. The dominate storage mechanism is still block-based storage technology.

Innovation in storage technology is especially pertinent to businesses that use network storage. The two dominant technologies of network storage: storage area network (SAN) and network-attached storage (NAS) each with their own benefits and drawbacks would benefit greatly with improvement in storage technology. Improvements that can provide better scalability, business intelligence, and management while ensuring security and data access speed of traditional storage solutions would be ideal.

Object Based Storage Devices (OSD) solve these issues because of how they are designed. Object storage uses objects that consists of data and meta-data that describe the object. They are accessed with defined methods such as read and write and carry a unique ID. They handle the underlying security, space allocation and basic storage routines.[[#Foot2|2]] This storage technology has the potential to address some of the problems with block-based storage.

With increased scalability, better security through per-object level access, ensured integrity of data with unique hash key's and benefits in management and business intelligence with rich meta-data, OSD can be seen as a viable alternative to improve the standard architectures of SAN and NAS.

== Overview of Block-Based Storage ==

Hard disks as a storage medium date back to the 1950s with the introduction of the IBM 350 disk storage unit.[[#Foot3|3]] Hard disks store data in blocks, which are a fixed length series' of bytes. Since early devices like the IBM 350, the interface that the operating system uses to communicate with the hard disk has remained mostly the same.[[#Foot4|4]] This interface simply allows the operating system to read or write to blocks on the disk. This means that the goal of abstracting stored data into related groups or into human-understandable constructs such as objects or files is left completely in the space of the operating system's filesystem. For example, when the filesystem wants to write data to a file it must translate that into a block on the disk to write to. In this way, the scope of a filesystem extends from high level constructs like files to low level constructs like blocks. This wide scope is necessary because of the simple interface presented to the filesystem that must be abstracted up to the complex expectations of a user.

Multiple standards exist to implement this interface. The small computer system interface (SCSI) standards, which have been around in one form or another since the late 1970s, are popular with industry. Parallel ATA, another standard which was designed in the 1980s, continues today in the form of Serial ATA (SATA). However, even though these standards have been around for a long time, "the logical interface, or the command set, has seen only minor additions"[[#Foot2|2]]. This means that the functionality that the command set allows has also remained mostly the same, since the functionality must be built on top of these dated commands.

== Overview of Object-Based Storage ==

Unlike block-based storage, object-based storage research started in 1990s. See for example the work of Gibson et al in "A Cost-Effective, High-Bandwidth Storage Architecture", Proceedings of the 8th Conference on Architectural Support for Programming Languages and Operating Systems, 1998. The fundamental idea of an object based storage device is to have the storage device itself handle a layer of abstraction on top of the block. Instead of the interface presenting the filesystem with blocks to read and write to, the interface presents the filesystem with "objects" which it can read to, write to, create, or destroy. Objects can be variable sized, and the device itself handles mapping onto physical memory. These objects also have metadata and access controls immediately associated with them. This allows the filesystem to work at a higher level of abstraction. This is important because the needs placed on filesystems have changed, and we will see as we compare object based storage with block based storage that the design of objects are more suited to the needs of today's filesystems, especially networked filesystems, than blocks.

== Changing Storage Needs ==

Storage needs have changed significantly since the first hard disks were developed in the 1950s, and the standardization of the interface in the 1970s. This means that the functionality of storage devices must also change to reflect these needs. Storage has become increasingly networked. Networked storage must deal with several issues. Firstly, the storage architecture must be able to scale to terabytes of data and beyond with many servers and clients while avoiding bottlenecks. The data stored on these networks has also become more sensitive. Personal information, such as financial, is stored in large databases. Sensitive corporate and governmental information is stored similarly. Since the value of data has increased, it becomes more important to ensure the data's integrity and security. Block based storage, as we will see, has difficulty dealing with these priorities because of limitations inherent in its design. Object based storage is more suited to address these issues by design.

== Comparison of object and block based stores ==
=== Scalability ===
Scalability is very important for large businesses that need to manage large data centers. Managing metadata while ensuring data access speed as the systems grows is paramount.

Most block based storage systems contain many layers of metadata. There are also various types of virtualized systems that contain metadata to deal with device diversity or remapping of blocks for archiving or duplication. Building systems to scale with the metadata becomes a major issue. But at the same time the current speeds of block-based storage needs to be maintained.

NAS is a file system that coordinates the interface between file blocks and the clients access to files. This is done through a single NAS head which usually has thousands of gigabytes of storage behind it.[[#Foot5|5]] All data traffic must flow through this single access point. The benefits of the NAS file system is through its ability to set block access, manage security, prevent unauthorized access to files and use metadata to map blocks into files for the client. However, this causes a bottleneck issue with all the data passing through one point. Another issue is managing the metadata. Metadata is shared among separate metadata servers remote from the hosts. Space allocation management on different storage system layers and applications that add policy and management metadata individually is spread throughout the system. So this results in the metadata becoming very hard to manage.

SAN's on the other hand offer file systems that are distributed, but provide a single system image of the file system. This means that a local user need not be concerned with where the data is physically stored, since a level of abstraction separates the user from the physical location of the data. This eliminates the bottleneck of NAS. In the past, SANs were implemented on private fiber channel networks, which were designed to emulate local storage media. As long as the network remained exclusive, it could be assumed that all the clients could be trusted, so security was not a primary concern. The lack of security concern is one of the main reasons that block storage was a viable option for SANs of the past. Modern SANs can serve a much larger set of users, not all of whom can or should be trusted. This, in addition to the possible adoption of IP based SAN solutions, make data security a primary concern[[#Foot6|6]]. Object stores can make user privilege management a much more manageable task, since each object can 'know' who is allowed to access it.

Object storage provides the ability to operate a SAN setup with direct access to data while offering better security and scalability with metadata. Each object comes with a set of access rules given to it by the management server and metadata is associated and stored directly with each data object and is automatically carried between layers and across devices. Space allocation and management metadata are the responsibility of the storage device.[[#Foot1|1]] This allows metadata layers to be folded, reducing server overhead and processing, and allows for larger clusters of storage compared with traditional block-based interfaces.

=== Integrity ===
Block based file systems in archive solutions usually have no built in mechanisms for assuring data integrity. A common best practice is to conduct frequent backups, which adds to the complexity of using file systems for archiving and scalability. The mechanisms for ensuring data integrity in OSDs have mechanisms that operate differently from block store systems.

One of the major problems with storage at the block level is that if there is an error in a block, it is almost impossible to determine what part of the file system is affected. It may be the case that the error in a particular block may not even contain any data. This usually happens during a backup procedure or when a controller is organizing data.

OSDs provide a level of abstraction that hides the fact that a disk device has blocks. It no longer matters to the file system manager what kind of disk drive is being used, it only worries about managing objects. This is done through managing metadata as well as maintaining internal copies of its metadata. Hence, OSDs have knowledge of its object layout even though one or more groups of objects are on different OSDs. In this way OSDs know what kind of space is being used or unused and can scan and correct errors without losing data. In the event of a failure in recovering a file or a number of files, traditional systems may have to do a complete file system restore. However, an OSDs awareness of its object layout enables it to recover data specific to a byte range and thus restore files in an efficient manner.

OSDs have another powerful feature. Each object file has an associated hash key that is generated uniquely to the contents of the file. Thus the file can be verified for accuracy to ensure the contents remain the same and integrity to ensure the data has not been corrupted. Also it can be used for management of data to flag duplicate data.[[#Foot1|1]]

=== Security ===

Security threats can be thought of as having four quadrants. External, internal, accidental and malicious. Block based stores have a variety of ways for handling security but there are basic concepts that SAN and NAS technologies use to secure data.

SAN has traditionally run on fibre channels. [[#Foot7|7]] For the sake of security, running a SAN on fibre channels help isolate its network as they do not communicate over TCP/IP connections. However, since the SAN devices themselves do not restrict access, it's up to the network infrastructure and host system to handle its security.

Zoning and LUN masking are typical ways SAN systems could use as security measures. Zoning allocates a certain amount of storage to clients. These zones are isolated and are not allowed to communicate outside their respective zone. LUN masking is similar to zoning, however, they differ in the type of devices being used. Switches utilize zoning while disk array controllers use LUN masking. A disk array controller is a device which manages the physical disk drives and interprets them as logical unit numbers. Thus, the term LUN masking.[[#Foot8|8]]

NAS has its own vulnerabilities but as with SAN, it is only as secure as the network they operate on. NAS security is conceptually simpler than SAN. NAS environments can administer security tasks as well as control disk usage quotas. The proprietary operating system it runs on has access control configurations much like other traditional OSs that can prevent unauthorized access to data.

Unlike NAS and SAN systems, OSD devices handle security requests directly. The set of protocols used by OSD enable it to cover the four quadrants of security threats outlined above. Clients can access an OSD device by providing "cryptographically secure credentials", called capabilities, which specify a tuple (OSD name, partition ID, object ID) to identify the object.[[#Foot9|9]] This can prevent accidental or even malicious access to an OSD externally or internally.

== Real World Implementation ==
'''Check the talk page. Anil mentioned this implementation to me so I thought it would be a good idea to have something on it in the essay. Do you guys agree? If we keep it we just need to add one more reference from this section'''

Ceph is an example of a real world networked storage system based around OSDs. The Ceph developers specifically list performance, reliability, and scalability as the benefits their system offers over current solutions.[[#Foot10|10]] Since Ceph is based on OSDs, it takes advantage of the ability for clients to interact directly with the devices, which avoids the traditional bottlenecks to performance caused by SAN controllers or NAS heads. This direct access allows Ceph to support a very large number of clients concurrently accessing data on the system. Since objects have security controls it can allow this direct access safely, unlike other network storage architectures.

== Conclusion ==
Although object storage is relatively new compared to block storage, work as progressed steadily in universities and on standards such as the ANSI T10 SCSI OSD standard. But there remains challenges to its adoption in the industry. One of which, is that it is only needed in high end business solutions at the moment, preventing it from reaching smaller businesses.[[#Foot11|11]] But as newer features are added and the standards mature we will see an increased adoption.

It is obvious however that changes do need to occur as storage grows and finer levels of management are needed for data storage. Object-based storage has evolved to fit these needs where block-based storage has stagnated. The better tools for managing the data using the rich metadata of objects, the security and data transfer speeds of NAS and SAN combined and integrity controls for backups and redundancies will be an attractive choice for storage administrators in the future.

==References==
1 Dell Product Group, 2010. Object Storage A Fresh Approach to Long-Term File Storage. [online] Dell Available at: <http://www.dell.com/downloads/global/products/pvaul/en/object-storage-overview.pdf> [Accessed 13 October 2010].

2 C. Bandulet, 2007. Object-Based Storage Devices. [online] Oracle Available at: <http://developers.sun.com/solaris/articles/osd.html> [Accessed 13 October 2010].

3 IBM 350 disk storage unit, IBM Archives. [online] IBM Available at : <http://www-03.ibm.com/ibm/history/exhibits/storage/storage_350.html> [Accessed 14 October 2010].

4 M. Mesnier, G. R. Ganger, and E. Riedel. Object-Based Storage. IEEE Communications Magazine, 41(8), August 2003.

5 TechRepublic Guest Contributor, Foundations of Network Storage, Lesson Two: NAS. [online] Available at <http://articles.techrepublic.com.com/5100-22_11-5841266.html> [Accessed 14 October 2010].

6 Satran and Teperman, Object Store Based SAN File Systems. [online] IBM Labs Available at: <http://www.research.ibm.com/haifa/projects/storage/zFS/papers/amalfi.pdf> [Accessed 14 October 2010].

7 J. Tate, F. Lucchese, R. Moore. Introduction to Storage Area Networks. [online] Available at <http://www.redbooks.ibm.com/redbooks/pdfs/sg245470.pdf> [Accessed 14 October 2010].

8 H. Yoshida. LUN Security Considerations for Storage Area Networks. [online] Available at <http://www.it.hds.com/pdf/wp91_san_lun_secur.pdf> [Accessed 14 October 2010].

9 M. Factor, D. Nagle, D. Naor, E. Riedel, J.Satran, 2005. The OSD Security Protocol. [online] Available at <http://www.research.ibm.com/haifa/projects/storage/objectstore/papers/OSDSecurityProtocol.pdf> [Accessed 14 October 2010].

10 S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long,
and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proc. OSDI, 2006. [online] Available at: <http://www.usenix.org/events/osdi06/tech/full_papers/weil/weil_html/> [Accessed 14 October 2010].

11 M. Factor, K. Meth, D. Naor, O. Rodeh, J. Satran, 2005. Object storage: The future building block for storage systems. In 2nd International IEEE Symposium on Mass Storage Systems and Technologies, Sardinia [online] Available at: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.3959&rep=rep1&type=pdf> [Accessed 13 October 2010].

COMP 3000 Essay 1 2010 Question 11

2010-10-15T02:33:52Z

Mbingham: /* References */

=Question=

Why are object stores an increasingly attractive building block for filesystems (as opposed to block-based stores)? Explain.

=Answer=

== Introduction ==

Each year we are faced with growing storage needs as the world's information increases exponentially, business' are increasingly choosing to archive and retain all the data they produce and "store everything, forever"[[#Foot1|1]] is the common mantra of storage administrators. The storage industry has been able to keep up with the increasing demand with matching increases in storage capacity. Unfortunately the interfaces between clients and storage devices has remained unchanged since the 1950's. The dominate storage mechanism is still block-based storage technology.

Innovation in storage technology is especially pertinent to businesses that use network storage. The two dominant technologies of network storage: storage area network (SAN) and network-attached storage (NAS) each with their own benefits and drawbacks would benefit greatly with improvement in storage technology. Improvements that can provide better scalability, business intelligence, and management while ensuring security and data access speed of traditional storage solutions would be ideal.

Object Based Storage Devices (OSD) solve these issues because of how they are designed. Object storage uses objects that consists of data and meta-data that describe the object. They are accessed with defined methods such as read and write and carry a unique ID. They handle the underlying security, space allocation and basic storage routines.[[#Foot2|2]] This storage technology has the potential to address some of the problems with block-based storage.

With increased scalability, better security through per-object level access, ensured integrity of data with unique hash key's and benefits in management and business intelligence with rich meta-data, OSD can be seen as a viable alternative to improve the standard architectures of SAN and NAS.

== Overview of Block-Based Storage ==

Hard disks as a storage medium date back to the 1950s with the introduction of the IBM 350 disk storage unit.[[#Foot3|3]] Hard disks store data in blocks, which are a fixed length series' of bytes. Since early devices like the IBM 350, the interface that the operating system uses to communicate with the hard disk has remained mostly the same.[[#Foot4|4]] This interface simply allows the operating system to read or write to blocks on the disk. This means that the goal of abstracting stored data into related groups or into human-understandable constructs such as objects or files is left completely in the space of the operating system's filesystem. For example, when the filesystem wants to write data to a file it must translate that into a block on the disk to write to. In this way, the scope of a filesystem extends from high level constructs like files to low level constructs like blocks. This wide scope is necessary because of the simple interface presented to the filesystem that must be abstracted up to the complex expectations of a user.

Multiple standards exist to implement this interface. The small computer system interface (SCSI) standards, which have been around in one form or another since the late 1970s, are popular with industry. Parallel ATA, another standard which was designed in the 1980s, continues today in the form of Serial ATA (SATA). However, even though these standards have been around for a long time, "the logical interface, or the command set, has seen only minor additions"[[#Foot2|2]]. This means that the functionality that the command set allows has also remained mostly the same, since the functionality must be built on top of these dated commands.

== Overview of Object-Based Storage ==

Unlike block-based storage, object-based storage research started in 1990s. See for example the work of Gibson et al in "A Cost-Effective, High-Bandwidth Storage Architecture", Proceedings of the 8th Conference on Architectural Support for Programming Languages and Operating Systems, 1998. The fundamental idea of an object based storage device is to have the storage device itself handle a layer of abstraction on top of the block. Instead of the interface presenting the filesystem with blocks to read and write to, the interface presents the filesystem with "objects" which it can read to, write to, create, or destroy. Objects can be variable sized, and the device itself handles mapping onto physical memory. These objects also have metadata and access controls immediately associated with them. This allows the filesystem to work at a higher level of abstraction. This is important because the needs placed on filesystems have changed, and we will see as we compare object based storage with block based storage that the design of objects are more suited to the needs of today's filesystems, especially networked filesystems, than blocks.

== Changing Storage Needs ==

Storage needs have changed significantly since the first hard disks were developed in the 1950s, and the standardization of the interface in the 1970s. This means that the functionality of storage devices must also change to reflect these needs. Storage has become increasingly networked. Networked storage must deal with several issues. Firstly, the storage architecture must be able to scale to terabytes of data and beyond with many servers and clients while avoiding bottlenecks. The data stored on these networks has also become more sensitive. Personal information, such as financial, is stored in large databases. Sensitive corporate and governmental information is stored similarly. Since the value of data has increased, it becomes more important to ensure the data's integrity and security. Block based storage, as we will see, has difficulty dealing with these priorities because of limitations inherent in its design. Object based storage is more suited to address these issues by design.

== Comparison of object and block based stores ==
=== Scalability ===
Scalability is very important for large businesses that need to manage large data centers. Managing metadata while ensuring data access speed as the systems grows is paramount.

Most block based storage systems contain many layers of metadata. There are also various types of virtualized systems that contain metadata to deal with device diversity or remapping of blocks for archiving or duplication. Building systems to scale with the metadata becomes a major issue. But at the same time the current speeds of block-based storage needs to be maintained.

NAS is a file system that coordinates the interface between file blocks and the clients access to files. This is done through a single NAS head which usually has thousands of gigabytes of storage behind it.[[#Foot5|5]] All data traffic must flow through this single access point. The benefits of the NAS file system is through its ability to set block access, manage security, prevent unauthorized access to files and use metadata to map blocks into files for the client. However, this causes a bottleneck issue with all the data passing through one point. Another issue is managing the metadata. Metadata is shared among separate metadata servers remote from the hosts. Space allocation management on different storage system layers and applications that add policy and management metadata individually is spread throughout the system. So this results in the metadata becoming very hard to manage.

SAN's on the other hand offer file systems that are distributed, but provide a single system image of the file system. This means that a local user need not be concerned with where the data is physically stored, since a level of abstraction separates the user from the physical location of the data. This eliminates the bottleneck of NAS. In the past, SANs were implemented on private fiber channel networks, which were designed to emulate local storage media. As long as the network remained exclusive, it could be assumed that all the clients could be trusted, so security was not a primary concern. The lack of security concern is one of the main reasons that block storage was a viable option for SANs of the past. Modern SANs can serve a much larger set of users, not all of whom can or should be trusted. This, in addition to the possible adoption of IP based SAN solutions, make data security a primary concern[[#Foot6|6]]. Object stores can make user privilege management a much more manageable task, since each object can 'know' who is allowed to access it.

Object storage provides the ability to operate a SAN setup with direct access to data while offering better security and scalability with metadata. Each object comes with a set of access rules given to it by the management server and metadata is associated and stored directly with each data object and is automatically carried between layers and across devices. Space allocation and management metadata are the responsibility of the storage device.[[#Foot1|1]] This allows metadata layers to be folded, reducing server overhead and processing, and allows for larger clusters of storage compared with traditional block-based interfaces.

=== Integrity ===
Block based file systems in archive solutions usually have no built in mechanisms for assuring data integrity. A common best practice is to conduct frequent backups, which adds to the complexity of using file systems for archiving and scalability. The mechanisms for ensuring data integrity in OSDs have mechanisms that operate differently from block store systems.

One of the major problems with storage at the block level is that if there is an error in a block, it is almost impossible to determine what part of the file system is affected. It may be the case that the error in a particular block may not even contain any data. This usually happens during a backup procedure or when a controller is organizing data.

OSDs provide a level of abstraction that hides the fact that a disk device has blocks. It no longer matters to the file system manager what kind of disk drive is being used, it only worries about managing objects. This is done through managing metadata as well as maintaining internal copies of its metadata. Hence, OSDs have knowledge of its object layout even though one or more groups of objects are on different OSDs. In this way OSDs know what kind of space is being used or unused and can scan and correct errors without losing data. In the event of a failure in recovering a file or a number of files, traditional systems may have to do a complete file system restore. However, an OSDs awareness of its object layout enables it to recover data specific to a byte range and thus restore files in an efficient manner.

OSDs have another powerful feature. Each object file has an associated hash key that is generated uniquely to the contents of the file. Thus the file can be verified for accuracy to ensure the contents remain the same and integrity to ensure the data has not been corrupted. Also it can be used for management of data to flag duplicate data.[[#Foot1|1]]

=== Security ===

Security threats can be thought of as having four quadrants. External, internal, accidental and malicious. Block based stores have a variety of ways for handling security but there are basic concepts that SAN and NAS technologies use to secure data.

SAN has traditionally run on fibre channels. [[#Foot7|7]] For the sake of security, running a SAN on fibre channels help isolate its network as they do not communicate over TCP/IP connections. However, since the SAN devices themselves do not restrict access, it's up to the network infrastructure and host system to handle its security.

Zoning and LUN masking are typical ways SAN systems could use as security measures. Zoning allocates a certain amount of storage to clients. These zones are isolated and are not allowed to communicate outside their respective zone. LUN masking is similar to zoning, however, they differ in the type of devices being used. Switches utilize zoning while disk array controllers use LUN masking. A disk array controller is a device which manages the physical disk drives and interprets them as logical unit numbers. Thus, the term LUN masking.[[#Foot8|8]]

NAS has its own vulnerabilities but as with SAN, it is only as secure as the network they operate on. NAS security is conceptually simpler than SAN. NAS environments can administer security tasks as well as control disk usage quotas. The proprietary operating system it runs on has access control configurations much like other traditional OSs that can prevent unauthorized access to data.

Unlike NAS and SAN systems, OSD devices handle security requests directly. The set of protocols used by OSD enable it to cover the four quadrants of security threats outlined above. Clients can access an OSD device by providing "cryptographically secure credentials", called capabilities, which specify a tuple (OSD name, partition ID, object ID) to identify the object.[[#Foot9|9]] This can prevent accidental or even malicious access to an OSD externally or internally.

== Real World Implementation ==
'''Check the talk page. Anil mentioned this implementation to me so I thought it would be a good idea to have something on it in the essay. Do you guys agree? If we keep it we just need to add one more reference from this section'''

Ceph is an example of a real world networked storage system based around OSDs. The Ceph developers specifically list performance, reliability, and scalability as the benefits their system offers over current solutions.[[#Foot10|10]] Since Ceph is based on OSDs, it takes advantage of the ability for clients to interact directly with the devices, which avoids the traditional bottlenecks to performance caused by SAN controllers or NAS heads. This direct access allows Ceph to support a very large number of clients concurrently accessing data on the system. Since objects have security controls it can allow this direct access safely, unlike other network storage architectures.

== Conclusion ==
Although object storage is relatively new compared to block storage, work as progressed steadily in universities and on standards such as the ANSI T10 SCSI OSD standard. But there remains challenges to its adoption in the industry. One of which, is that it is only needed in high end business solutions at the moment, preventing it from reaching smaller businesses.[[#Foot10|10]] But as newer features are added and the standards mature we will see an increased adoption.

It is obvious however that changes do need to occur as storage grows and finer levels of management are needed for data storage. Object-based storage has evolved to fit these needs where block-based storage has stagnated. The better tools for managing the data using the rich metadata of objects, the security and data transfer speeds of NAS and SAN combined and integrity controls for backups and redundancies will be an attractive choice for storage administrators in the future.

==References==
1 Dell Product Group, 2010. Object Storage A Fresh Approach to Long-Term File Storage. [online] Dell Available at: <http://www.dell.com/downloads/global/products/pvaul/en/object-storage-overview.pdf> [Accessed 13 October 2010].

2 C. Bandulet, 2007. Object-Based Storage Devices. [online] Oracle Available at: <http://developers.sun.com/solaris/articles/osd.html> [Accessed 13 October 2010].

3 IBM 350 disk storage unit, IBM Archives. [online] IBM Available at : <http://www-03.ibm.com/ibm/history/exhibits/storage/storage_350.html> [Accessed 14 October 2010].

4 M. Mesnier, G. R. Ganger, and E. Riedel. Object-Based Storage. IEEE Communications Magazine, 41(8), August 2003.

5 TechRepublic Guest Contributor, Foundations of Network Storage, Lesson Two: NAS. [online] Available at <http://articles.techrepublic.com.com/5100-22_11-5841266.html> [Accessed 14 October 2010].

6 Satran and Teperman, Object Store Based SAN File Systems. [online] IBM Labs Available at: <http://www.research.ibm.com/haifa/projects/storage/zFS/papers/amalfi.pdf> [Accessed 14 October 2010].

7 J. Tate, F. Lucchese, R. Moore. Introduction to Storage Area Networks. [online] Available at <http://www.redbooks.ibm.com/redbooks/pdfs/sg245470.pdf> [Accessed 14 October 2010].

8 H. Yoshida. LUN Security Considerations for Storage Area Networks. [online] Available at <http://www.it.hds.com/pdf/wp91_san_lun_secur.pdf> [Accessed 14 October 2010].

9 M. Factor, D. Nagle, D. Naor, E. Riedel, J.Satran, 2005. The OSD Security Protocol. [online] Available at <http://www.research.ibm.com/haifa/projects/storage/objectstore/papers/OSDSecurityProtocol.pdf> [Accessed 14 October 2010].

10 S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long,
and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proc. OSDI, 2006. [online] Available at: <http://www.usenix.org/events/osdi06/tech/full_papers/weil/weil_html/> [Accessed 14 October 2010].

11 M. Factor, K. Meth, D. Naor, O. Rodeh, J. Satran, 2005. Object storage: The future building block for storage systems. In 2nd International IEEE Symposium on Mass Storage Systems and Technologies, Sardinia [online] Available at: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.3959&rep=rep1&type=pdf> [Accessed 13 October 2010].

COMP 3000 Essay 1 2010 Question 11

2010-10-15T02:32:25Z

Mbingham: /* Real World Implementation */

=Question=

Why are object stores an increasingly attractive building block for filesystems (as opposed to block-based stores)? Explain.

=Answer=

== Introduction ==

Each year we are faced with growing storage needs as the world's information increases exponentially, business' are increasingly choosing to archive and retain all the data they produce and "store everything, forever"[[#Foot1|1]] is the common mantra of storage administrators. The storage industry has been able to keep up with the increasing demand with matching increases in storage capacity. Unfortunately the interfaces between clients and storage devices has remained unchanged since the 1950's. The dominate storage mechanism is still block-based storage technology.

Innovation in storage technology is especially pertinent to businesses that use network storage. The two dominant technologies of network storage: storage area network (SAN) and network-attached storage (NAS) each with their own benefits and drawbacks would benefit greatly with improvement in storage technology. Improvements that can provide better scalability, business intelligence, and management while ensuring security and data access speed of traditional storage solutions would be ideal.

Object Based Storage Devices (OSD) solve these issues because of how they are designed. Object storage uses objects that consists of data and meta-data that describe the object. They are accessed with defined methods such as read and write and carry a unique ID. They handle the underlying security, space allocation and basic storage routines.[[#Foot2|2]] This storage technology has the potential to address some of the problems with block-based storage.

With increased scalability, better security through per-object level access, ensured integrity of data with unique hash key's and benefits in management and business intelligence with rich meta-data, OSD can be seen as a viable alternative to improve the standard architectures of SAN and NAS.

== Overview of Block-Based Storage ==

Hard disks as a storage medium date back to the 1950s with the introduction of the IBM 350 disk storage unit.[[#Foot3|3]] Hard disks store data in blocks, which are a fixed length series' of bytes. Since early devices like the IBM 350, the interface that the operating system uses to communicate with the hard disk has remained mostly the same.[[#Foot4|4]] This interface simply allows the operating system to read or write to blocks on the disk. This means that the goal of abstracting stored data into related groups or into human-understandable constructs such as objects or files is left completely in the space of the operating system's filesystem. For example, when the filesystem wants to write data to a file it must translate that into a block on the disk to write to. In this way, the scope of a filesystem extends from high level constructs like files to low level constructs like blocks. This wide scope is necessary because of the simple interface presented to the filesystem that must be abstracted up to the complex expectations of a user.

Multiple standards exist to implement this interface. The small computer system interface (SCSI) standards, which have been around in one form or another since the late 1970s, are popular with industry. Parallel ATA, another standard which was designed in the 1980s, continues today in the form of Serial ATA (SATA). However, even though these standards have been around for a long time, "the logical interface, or the command set, has seen only minor additions"[[#Foot2|2]]. This means that the functionality that the command set allows has also remained mostly the same, since the functionality must be built on top of these dated commands.

== Overview of Object-Based Storage ==

Unlike block-based storage, object-based storage research started in 1990s. See for example the work of Gibson et al in "A Cost-Effective, High-Bandwidth Storage Architecture", Proceedings of the 8th Conference on Architectural Support for Programming Languages and Operating Systems, 1998. The fundamental idea of an object based storage device is to have the storage device itself handle a layer of abstraction on top of the block. Instead of the interface presenting the filesystem with blocks to read and write to, the interface presents the filesystem with "objects" which it can read to, write to, create, or destroy. Objects can be variable sized, and the device itself handles mapping onto physical memory. These objects also have metadata and access controls immediately associated with them. This allows the filesystem to work at a higher level of abstraction. This is important because the needs placed on filesystems have changed, and we will see as we compare object based storage with block based storage that the design of objects are more suited to the needs of today's filesystems, especially networked filesystems, than blocks.

== Changing Storage Needs ==

Storage needs have changed significantly since the first hard disks were developed in the 1950s, and the standardization of the interface in the 1970s. This means that the functionality of storage devices must also change to reflect these needs. Storage has become increasingly networked. Networked storage must deal with several issues. Firstly, the storage architecture must be able to scale to terabytes of data and beyond with many servers and clients while avoiding bottlenecks. The data stored on these networks has also become more sensitive. Personal information, such as financial, is stored in large databases. Sensitive corporate and governmental information is stored similarly. Since the value of data has increased, it becomes more important to ensure the data's integrity and security. Block based storage, as we will see, has difficulty dealing with these priorities because of limitations inherent in its design. Object based storage is more suited to address these issues by design.

== Comparison of object and block based stores ==
=== Scalability ===
Scalability is very important for large businesses that need to manage large data centers. Managing metadata while ensuring data access speed as the systems grows is paramount.

Most block based storage systems contain many layers of metadata. There are also various types of virtualized systems that contain metadata to deal with device diversity or remapping of blocks for archiving or duplication. Building systems to scale with the metadata becomes a major issue. But at the same time the current speeds of block-based storage needs to be maintained.

NAS is a file system that coordinates the interface between file blocks and the clients access to files. This is done through a single NAS head which usually has thousands of gigabytes of storage behind it.[[#Foot5|5]] All data traffic must flow through this single access point. The benefits of the NAS file system is through its ability to set block access, manage security, prevent unauthorized access to files and use metadata to map blocks into files for the client. However, this causes a bottleneck issue with all the data passing through one point. Another issue is managing the metadata. Metadata is shared among separate metadata servers remote from the hosts. Space allocation management on different storage system layers and applications that add policy and management metadata individually is spread throughout the system. So this results in the metadata becoming very hard to manage.

SAN's on the other hand offer file systems that are distributed, but provide a single system image of the file system. This means that a local user need not be concerned with where the data is physically stored, since a level of abstraction separates the user from the physical location of the data. This eliminates the bottleneck of NAS. In the past, SANs were implemented on private fiber channel networks, which were designed to emulate local storage media. As long as the network remained exclusive, it could be assumed that all the clients could be trusted, so security was not a primary concern. The lack of security concern is one of the main reasons that block storage was a viable option for SANs of the past. Modern SANs can serve a much larger set of users, not all of whom can or should be trusted. This, in addition to the possible adoption of IP based SAN solutions, make data security a primary concern[[#Foot6|6]]. Object stores can make user privilege management a much more manageable task, since each object can 'know' who is allowed to access it.

Object storage provides the ability to operate a SAN setup with direct access to data while offering better security and scalability with metadata. Each object comes with a set of access rules given to it by the management server and metadata is associated and stored directly with each data object and is automatically carried between layers and across devices. Space allocation and management metadata are the responsibility of the storage device.[[#Foot1|1]] This allows metadata layers to be folded, reducing server overhead and processing, and allows for larger clusters of storage compared with traditional block-based interfaces.

=== Integrity ===
Block based file systems in archive solutions usually have no built in mechanisms for assuring data integrity. A common best practice is to conduct frequent backups, which adds to the complexity of using file systems for archiving and scalability. The mechanisms for ensuring data integrity in OSDs have mechanisms that operate differently from block store systems.

One of the major problems with storage at the block level is that if there is an error in a block, it is almost impossible to determine what part of the file system is affected. It may be the case that the error in a particular block may not even contain any data. This usually happens during a backup procedure or when a controller is organizing data.

OSDs provide a level of abstraction that hides the fact that a disk device has blocks. It no longer matters to the file system manager what kind of disk drive is being used, it only worries about managing objects. This is done through managing metadata as well as maintaining internal copies of its metadata. Hence, OSDs have knowledge of its object layout even though one or more groups of objects are on different OSDs. In this way OSDs know what kind of space is being used or unused and can scan and correct errors without losing data. In the event of a failure in recovering a file or a number of files, traditional systems may have to do a complete file system restore. However, an OSDs awareness of its object layout enables it to recover data specific to a byte range and thus restore files in an efficient manner.

OSDs have another powerful feature. Each object file has an associated hash key that is generated uniquely to the contents of the file. Thus the file can be verified for accuracy to ensure the contents remain the same and integrity to ensure the data has not been corrupted. Also it can be used for management of data to flag duplicate data.[[#Foot1|1]]

=== Security ===

Security threats can be thought of as having four quadrants. External, internal, accidental and malicious. Block based stores have a variety of ways for handling security but there are basic concepts that SAN and NAS technologies use to secure data.

SAN has traditionally run on fibre channels. [[#Foot7|7]] For the sake of security, running a SAN on fibre channels help isolate its network as they do not communicate over TCP/IP connections. However, since the SAN devices themselves do not restrict access, it's up to the network infrastructure and host system to handle its security.

Zoning and LUN masking are typical ways SAN systems could use as security measures. Zoning allocates a certain amount of storage to clients. These zones are isolated and are not allowed to communicate outside their respective zone. LUN masking is similar to zoning, however, they differ in the type of devices being used. Switches utilize zoning while disk array controllers use LUN masking. A disk array controller is a device which manages the physical disk drives and interprets them as logical unit numbers. Thus, the term LUN masking.[[#Foot8|8]]

NAS has its own vulnerabilities but as with SAN, it is only as secure as the network they operate on. NAS security is conceptually simpler than SAN. NAS environments can administer security tasks as well as control disk usage quotas. The proprietary operating system it runs on has access control configurations much like other traditional OSs that can prevent unauthorized access to data.

Unlike NAS and SAN systems, OSD devices handle security requests directly. The set of protocols used by OSD enable it to cover the four quadrants of security threats outlined above. Clients can access an OSD device by providing "cryptographically secure credentials", called capabilities, which specify a tuple (OSD name, partition ID, object ID) to identify the object.[[#Foot9|9]] This can prevent accidental or even malicious access to an OSD externally or internally.

== Real World Implementation ==
'''Check the talk page. Anil mentioned this implementation to me so I thought it would be a good idea to have something on it in the essay. Do you guys agree? If we keep it we just need to add one more reference from this section'''

Ceph is an example of a real world networked storage system based around OSDs. The Ceph developers specifically list performance, reliability, and scalability as the benefits their system offers over current solutions.[[#Foot10|10]] Since Ceph is based on OSDs, it takes advantage of the ability for clients to interact directly with the devices, which avoids the traditional bottlenecks to performance caused by SAN controllers or NAS heads. This direct access allows Ceph to support a very large number of clients concurrently accessing data on the system. Since objects have security controls it can allow this direct access safely, unlike other network storage architectures.

== Conclusion ==
Although object storage is relatively new compared to block storage, work as progressed steadily in universities and on standards such as the ANSI T10 SCSI OSD standard. But there remains challenges to its adoption in the industry. One of which, is that it is only needed in high end business solutions at the moment, preventing it from reaching smaller businesses.[[#Foot10|10]] But as newer features are added and the standards mature we will see an increased adoption.

It is obvious however that changes do need to occur as storage grows and finer levels of management are needed for data storage. Object-based storage has evolved to fit these needs where block-based storage has stagnated. The better tools for managing the data using the rich metadata of objects, the security and data transfer speeds of NAS and SAN combined and integrity controls for backups and redundancies will be an attractive choice for storage administrators in the future.

==References==
1 Dell Product Group, 2010. Object Storage A Fresh Approach to Long-Term File Storage. [online] Dell Available at: <http://www.dell.com/downloads/global/products/pvaul/en/object-storage-overview.pdf> [Accessed 13 October 2010].

2 C. Bandulet, 2007. Object-Based Storage Devices. [online] Oracle Available at: <http://developers.sun.com/solaris/articles/osd.html> [Accessed 13 October 2010].

3 IBM 350 disk storage unit, IBM Archives. [online] IBM Available at : <http://www-03.ibm.com/ibm/history/exhibits/storage/storage_350.html> [Accessed 14 October 2010].

4 M. Mesnier, G. R. Ganger, and E. Riedel. Object-Based Storage. IEEE Communications Magazine, 41(8), August 2003.

5 TechRepublic Guest Contributor, Foundations of Network Storage, Lesson Two: NAS. [online] Available at <http://articles.techrepublic.com.com/5100-22_11-5841266.html> [Accessed 14 October 2010].

6 Satran and Teperman, Object Store Based SAN File Systems. [online] IBM Labs Available at: <http://www.research.ibm.com/haifa/projects/storage/zFS/papers/amalfi.pdf> [Accessed 14 October 2010].

7 J. Tate, F. Lucchese, R. Moore. Introduction to Storage Area Networks. [online] Available at <http://www.redbooks.ibm.com/redbooks/pdfs/sg245470.pdf> [Accessed 14 October 2010].

8 H. Yoshida. LUN Security Considerations for Storage Area Networks. [online] Available at <http://www.it.hds.com/pdf/wp91_san_lun_secur.pdf> [Accessed 14 October 2010].

9 M. Factor, D. Nagle, D. Naor, E. Riedel, J.Satran, 2005. The OSD Security Protocol. [online] Available at <http://www.research.ibm.com/haifa/projects/storage/objectstore/papers/OSDSecurityProtocol.pdf> [Accessed 14 October 2010].

10 M. Factor, K. Meth, D. Naor, O. Rodeh, J. Satran, 2005. Object storage: The future building block for storage systems. In 2nd International IEEE Symposium on Mass Storage Systems and Technologies, Sardinia [online] Available at: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.3959&rep=rep1&type=pdf> [Accessed 13 October 2010].

COMP 3000 Essay 1 2010 Question 11

2010-10-15T02:27:29Z

Mbingham: /* Real World Implementation */

=Question=

Why are object stores an increasingly attractive building block for filesystems (as opposed to block-based stores)? Explain.

=Answer=

== Introduction ==

Each year we are faced with growing storage needs as the world's information increases exponentially, business' are increasingly choosing to archive and retain all the data they produce and "store everything, forever"[[#Foot1|1]] is the common mantra of storage administrators. The storage industry has been able to keep up with the increasing demand with matching increases in storage capacity. Unfortunately the interfaces between clients and storage devices has remained unchanged since the 1950's. The dominate storage mechanism is still block-based storage technology.

Innovation in storage technology is especially pertinent to businesses that use network storage. The two dominant technologies of network storage: storage area network (SAN) and network-attached storage (NAS) each with their own benefits and drawbacks would benefit greatly with improvement in storage technology. Improvements that can provide better scalability, business intelligence, and management while ensuring security and data access speed of traditional storage solutions would be ideal.

Object Based Storage Devices (OSD) solve these issues because of how they are designed. Object storage uses objects that consists of data and meta-data that describe the object. They are accessed with defined methods such as read and write and carry a unique ID. They handle the underlying security, space allocation and basic storage routines.[[#Foot2|2]] This storage technology has the potential to address some of the problems with block-based storage.

With increased scalability, better security through per-object level access, ensured integrity of data with unique hash key's and benefits in management and business intelligence with rich meta-data, OSD can be seen as a viable alternative to improve the standard architectures of SAN and NAS.

== Overview of Block-Based Storage ==

Hard disks as a storage medium date back to the 1950s with the introduction of the IBM 350 disk storage unit.[[#Foot3|3]] Hard disks store data in blocks, which are a fixed length series' of bytes. Since early devices like the IBM 350, the interface that the operating system uses to communicate with the hard disk has remained mostly the same.[[#Foot4|4]] This interface simply allows the operating system to read or write to blocks on the disk. This means that the goal of abstracting stored data into related groups or into human-understandable constructs such as objects or files is left completely in the space of the operating system's filesystem. For example, when the filesystem wants to write data to a file it must translate that into a block on the disk to write to. In this way, the scope of a filesystem extends from high level constructs like files to low level constructs like blocks. This wide scope is necessary because of the simple interface presented to the filesystem that must be abstracted up to the complex expectations of a user.

Multiple standards exist to implement this interface. The small computer system interface (SCSI) standards, which have been around in one form or another since the late 1970s, are popular with industry. Parallel ATA, another standard which was designed in the 1980s, continues today in the form of Serial ATA (SATA). However, even though these standards have been around for a long time, "the logical interface, or the command set, has seen only minor additions"[[#Foot2|2]]. This means that the functionality that the command set allows has also remained mostly the same, since the functionality must be built on top of these dated commands.

== Overview of Object-Based Storage ==

Unlike block-based storage, object-based storage research started in 1990s. See for example the work of Gibson et al in "A Cost-Effective, High-Bandwidth Storage Architecture", Proceedings of the 8th Conference on Architectural Support for Programming Languages and Operating Systems, 1998. The fundamental idea of an object based storage device is to have the storage device itself handle a layer of abstraction on top of the block. Instead of the interface presenting the filesystem with blocks to read and write to, the interface presents the filesystem with "objects" which it can read to, write to, create, or destroy. Objects can be variable sized, and the device itself handles mapping onto physical memory. These objects also have metadata and access controls immediately associated with them. This allows the filesystem to work at a higher level of abstraction. This is important because the needs placed on filesystems have changed, and we will see as we compare object based storage with block based storage that the design of objects are more suited to the needs of today's filesystems, especially networked filesystems, than blocks.

== Changing Storage Needs ==

Storage needs have changed significantly since the first hard disks were developed in the 1950s, and the standardization of the interface in the 1970s. This means that the functionality of storage devices must also change to reflect these needs. Storage has become increasingly networked. Networked storage must deal with several issues. Firstly, the storage architecture must be able to scale to terabytes of data and beyond with many servers and clients while avoiding bottlenecks. The data stored on these networks has also become more sensitive. Personal information, such as financial, is stored in large databases. Sensitive corporate and governmental information is stored similarly. Since the value of data has increased, it becomes more important to ensure the data's integrity and security. Block based storage, as we will see, has difficulty dealing with these priorities because of limitations inherent in its design. Object based storage is more suited to address these issues by design.

== Comparison of object and block based stores ==
=== Scalability ===
Scalability is very important for large businesses that need to manage large data centers. Managing metadata while ensuring data access speed as the systems grows is paramount.

Most block based storage systems contain many layers of metadata. There are also various types of virtualized systems that contain metadata to deal with device diversity or remapping of blocks for archiving or duplication. Building systems to scale with the metadata becomes a major issue. But at the same time the current speeds of block-based storage needs to be maintained.

NAS is a file system that coordinates the interface between file blocks and the clients access to files. This is done through a single NAS head which usually has thousands of gigabytes of storage behind it.[[#Foot5|5]] All data traffic must flow through this single access point. The benefits of the NAS file system is through its ability to set block access, manage security, prevent unauthorized access to files and use metadata to map blocks into files for the client. However, this causes a bottleneck issue with all the data passing through one point. Another issue is managing the metadata. Metadata is shared among separate metadata servers remote from the hosts. Space allocation management on different storage system layers and applications that add policy and management metadata individually is spread throughout the system. So this results in the metadata becoming very hard to manage.

SAN's on the other hand offer file systems that are distributed, but provide a single system image of the file system. This means that a local user need not be concerned with where the data is physically stored, since a level of abstraction separates the user from the physical location of the data. This eliminates the bottleneck of NAS. In the past, SANs were implemented on private fiber channel networks, which were designed to emulate local storage media. As long as the network remained exclusive, it could be assumed that all the clients could be trusted, so security was not a primary concern. The lack of security concern is one of the main reasons that block storage was a viable option for SANs of the past. Modern SANs can serve a much larger set of users, not all of whom can or should be trusted. This, in addition to the possible adoption of IP based SAN solutions, make data security a primary concern[[#Foot6|6]]. Object stores can make user privilege management a much more manageable task, since each object can 'know' who is allowed to access it.

Object storage provides the ability to operate a SAN setup with direct access to data while offering better security and scalability with metadata. Each object comes with a set of access rules given to it by the management server and metadata is associated and stored directly with each data object and is automatically carried between layers and across devices. Space allocation and management metadata are the responsibility of the storage device.[[#Foot1|1]] This allows metadata layers to be folded, reducing server overhead and processing, and allows for larger clusters of storage compared with traditional block-based interfaces.

=== Integrity ===
Block based file systems in archive solutions usually have no built in mechanisms for assuring data integrity. A common best practice is to conduct frequent backups, which adds to the complexity of using file systems for archiving and scalability. The mechanisms for ensuring data integrity in OSDs have mechanisms that operate differently from block store systems.

One of the major problems with storage at the block level is that if there is an error in a block, it is almost impossible to determine what part of the file system is affected. It may be the case that the error in a particular block may not even contain any data. This usually happens during a backup procedure or when a controller is organizing data.

OSDs provide a level of abstraction that hides the fact that a disk device has blocks. It no longer matters to the file system manager what kind of disk drive is being used, it only worries about managing objects. This is done through managing metadata as well as maintaining internal copies of its metadata. Hence, OSDs have knowledge of its object layout even though one or more groups of objects are on different OSDs. In this way OSDs know what kind of space is being used or unused and can scan and correct errors without losing data. In the event of a failure in recovering a file or a number of files, traditional systems may have to do a complete file system restore. However, an OSDs awareness of its object layout enables it to recover data specific to a byte range and thus restore files in an efficient manner.

OSDs have another powerful feature. Each object file has an associated hash key that is generated uniquely to the contents of the file. Thus the file can be verified for accuracy to ensure the contents remain the same and integrity to ensure the data has not been corrupted. Also it can be used for management of data to flag duplicate data.[[#Foot1|1]]

=== Security ===

Security threats can be thought of as having four quadrants. External, internal, accidental and malicious. Block based stores have a variety of ways for handling security but there are basic concepts that SAN and NAS technologies use to secure data.

SAN has traditionally run on fibre channels. [[#Foot7|7]] For the sake of security, running a SAN on fibre channels help isolate its network as they do not communicate over TCP/IP connections. However, since the SAN devices themselves do not restrict access, it's up to the network infrastructure and host system to handle its security.

Zoning and LUN masking are typical ways SAN systems could use as security measures. Zoning allocates a certain amount of storage to clients. These zones are isolated and are not allowed to communicate outside their respective zone. LUN masking is similar to zoning, however, they differ in the type of devices being used. Switches utilize zoning while disk array controllers use LUN masking. A disk array controller is a device which manages the physical disk drives and interprets them as logical unit numbers. Thus, the term LUN masking.[[#Foot8|8]]

NAS has its own vulnerabilities but as with SAN, it is only as secure as the network they operate on. NAS security is conceptually simpler than SAN. NAS environments can administer security tasks as well as control disk usage quotas. The proprietary operating system it runs on has access control configurations much like other traditional OSs that can prevent unauthorized access to data.

Unlike NAS and SAN systems, OSD devices handle security requests directly. The set of protocols used by OSD enable it to cover the four quadrants of security threats outlined above. Clients can access an OSD device by providing "cryptographically secure credentials", called capabilities, which specify a tuple (OSD name, partition ID, object ID) to identify the object.[[#Foot9|9]] This can prevent accidental or even malicious access to an OSD externally or internally.

== Real World Implementation ==
'''Check the talk page. Anil mentioned this implementation to me so I thought it would be a good idea to have something on it in the essay. Do you guys agree? If we keep it we just need to add one more reference from this section'''

Ceph is an example of a real world networked storage system based around OSDs. The Ceph developers specifically list performance, reliability, and scalability as the benefits their system offers over current solutions. (reference to S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long,
and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proc. OSDI, 2006.) Since Ceph is based on OSDs, it takes advantage of the ability for clients to interact directly with the devices, which avoids the traditional bottlenecks to performance caused by SAN controllers or NAS heads. This direct access allows Ceph to support a very large number of clients concurrently accessing data on the system. Since objects have security controls it can allow this direct access safely, unlike other network storage architectures.

== Conclusion ==
Although object storage is relatively new compared to block storage, work as progressed steadily in universities and on standards such as the ANSI T10 SCSI OSD standard. But there remains challenges to its adoption in the industry. One of which, is that it is only needed in high end business solutions at the moment, preventing it from reaching smaller businesses.[[#Foot10|10]] But as newer features are added and the standards mature we will see an increased adoption.

It is obvious however that changes do need to occur as storage grows and finer levels of management are needed for data storage. Object-based storage has evolved to fit these needs where block-based storage has stagnated. The better tools for managing the data using the rich metadata of objects, the security and data transfer speeds of NAS and SAN combined and integrity controls for backups and redundancies will be an attractive choice for storage administrators in the future.

==References==
1 Dell Product Group, 2010. Object Storage A Fresh Approach to Long-Term File Storage. [online] Dell Available at: <http://www.dell.com/downloads/global/products/pvaul/en/object-storage-overview.pdf> [Accessed 13 October 2010].

2 C. Bandulet, 2007. Object-Based Storage Devices. [online] Oracle Available at: <http://developers.sun.com/solaris/articles/osd.html> [Accessed 13 October 2010].

3 IBM 350 disk storage unit, IBM Archives. [online] IBM Available at : <http://www-03.ibm.com/ibm/history/exhibits/storage/storage_350.html> [Accessed 14 October 2010].

4 M. Mesnier, G. R. Ganger, and E. Riedel. Object-Based Storage. IEEE Communications Magazine, 41(8), August 2003.

5 TechRepublic Guest Contributor, Foundations of Network Storage, Lesson Two: NAS. [online] Available at <http://articles.techrepublic.com.com/5100-22_11-5841266.html> [Accessed 14 October 2010].

6 Satran and Teperman, Object Store Based SAN File Systems. [online] IBM Labs Available at: <http://www.research.ibm.com/haifa/projects/storage/zFS/papers/amalfi.pdf> [Accessed 14 October 2010].

7 J. Tate, F. Lucchese, R. Moore. Introduction to Storage Area Networks. [online] Available at <http://www.redbooks.ibm.com/redbooks/pdfs/sg245470.pdf> [Accessed 14 October 2010].

8 H. Yoshida. LUN Security Considerations for Storage Area Networks. [online] Available at <http://www.it.hds.com/pdf/wp91_san_lun_secur.pdf> [Accessed 14 October 2010].

9 M. Factor, D. Nagle, D. Naor, E. Riedel, J.Satran, 2005. The OSD Security Protocol. [online] Available at <http://www.research.ibm.com/haifa/projects/storage/objectstore/papers/OSDSecurityProtocol.pdf> [Accessed 14 October 2010].

10 M. Factor, K. Meth, D. Naor, O. Rodeh, J. Satran, 2005. Object storage: The future building block for storage systems. In 2nd International IEEE Symposium on Mass Storage Systems and Technologies, Sardinia [online] Available at: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.3959&rep=rep1&type=pdf> [Accessed 13 October 2010].

COMP 3000 Essay 1 2010 Question 11

2010-10-15T02:25:19Z

Mbingham: /* Comparison of object and block based stores */

=Question=

Why are object stores an increasingly attractive building block for filesystems (as opposed to block-based stores)? Explain.

=Answer=

== Introduction ==

Each year we are faced with growing storage needs as the world's information increases exponentially, business' are increasingly choosing to archive and retain all the data they produce and "store everything, forever"[[#Foot1|1]] is the common mantra of storage administrators. The storage industry has been able to keep up with the increasing demand with matching increases in storage capacity. Unfortunately the interfaces between clients and storage devices has remained unchanged since the 1950's. The dominate storage mechanism is still block-based storage technology.

Innovation in storage technology is especially pertinent to businesses that use network storage. The two dominant technologies of network storage: storage area network (SAN) and network-attached storage (NAS) each with their own benefits and drawbacks would benefit greatly with improvement in storage technology. Improvements that can provide better scalability, business intelligence, and management while ensuring security and data access speed of traditional storage solutions would be ideal.

Object Based Storage Devices (OSD) solve these issues because of how they are designed. Object storage uses objects that consists of data and meta-data that describe the object. They are accessed with defined methods such as read and write and carry a unique ID. They handle the underlying security, space allocation and basic storage routines.[[#Foot2|2]] This storage technology has the potential to address some of the problems with block-based storage.

With increased scalability, better security through per-object level access, ensured integrity of data with unique hash key's and benefits in management and business intelligence with rich meta-data, OSD can be seen as a viable alternative to improve the standard architectures of SAN and NAS.

== Overview of Block-Based Storage ==

Hard disks as a storage medium date back to the 1950s with the introduction of the IBM 350 disk storage unit.[[#Foot3|3]] Hard disks store data in blocks, which are a fixed length series' of bytes. Since early devices like the IBM 350, the interface that the operating system uses to communicate with the hard disk has remained mostly the same.[[#Foot4|4]] This interface simply allows the operating system to read or write to blocks on the disk. This means that the goal of abstracting stored data into related groups or into human-understandable constructs such as objects or files is left completely in the space of the operating system's filesystem. For example, when the filesystem wants to write data to a file it must translate that into a block on the disk to write to. In this way, the scope of a filesystem extends from high level constructs like files to low level constructs like blocks. This wide scope is necessary because of the simple interface presented to the filesystem that must be abstracted up to the complex expectations of a user.

Multiple standards exist to implement this interface. The small computer system interface (SCSI) standards, which have been around in one form or another since the late 1970s, are popular with industry. Parallel ATA, another standard which was designed in the 1980s, continues today in the form of Serial ATA (SATA). However, even though these standards have been around for a long time, "the logical interface, or the command set, has seen only minor additions"[[#Foot2|2]]. This means that the functionality that the command set allows has also remained mostly the same, since the functionality must be built on top of these dated commands.

== Overview of Object-Based Storage ==

Unlike block-based storage, object-based storage research started in 1990s. See for example the work of Gibson et al in "A Cost-Effective, High-Bandwidth Storage Architecture", Proceedings of the 8th Conference on Architectural Support for Programming Languages and Operating Systems, 1998. The fundamental idea of an object based storage device is to have the storage device itself handle a layer of abstraction on top of the block. Instead of the interface presenting the filesystem with blocks to read and write to, the interface presents the filesystem with "objects" which it can read to, write to, create, or destroy. Objects can be variable sized, and the device itself handles mapping onto physical memory. These objects also have metadata and access controls immediately associated with them. This allows the filesystem to work at a higher level of abstraction. This is important because the needs placed on filesystems have changed, and we will see as we compare object based storage with block based storage that the design of objects are more suited to the needs of today's filesystems, especially networked filesystems, than blocks.

== Changing Storage Needs ==

Storage needs have changed significantly since the first hard disks were developed in the 1950s, and the standardization of the interface in the 1970s. This means that the functionality of storage devices must also change to reflect these needs. Storage has become increasingly networked. Networked storage must deal with several issues. Firstly, the storage architecture must be able to scale to terabytes of data and beyond with many servers and clients while avoiding bottlenecks. The data stored on these networks has also become more sensitive. Personal information, such as financial, is stored in large databases. Sensitive corporate and governmental information is stored similarly. Since the value of data has increased, it becomes more important to ensure the data's integrity and security. Block based storage, as we will see, has difficulty dealing with these priorities because of limitations inherent in its design. Object based storage is more suited to address these issues by design.

== Comparison of object and block based stores ==
=== Scalability ===
Scalability is very important for large businesses that need to manage large data centers. Managing metadata while ensuring data access speed as the systems grows is paramount.

Most block based storage systems contain many layers of metadata. There are also various types of virtualized systems that contain metadata to deal with device diversity or remapping of blocks for archiving or duplication. Building systems to scale with the metadata becomes a major issue. But at the same time the current speeds of block-based storage needs to be maintained.

NAS is a file system that coordinates the interface between file blocks and the clients access to files. This is done through a single NAS head which usually has thousands of gigabytes of storage behind it.[[#Foot5|5]] All data traffic must flow through this single access point. The benefits of the NAS file system is through its ability to set block access, manage security, prevent unauthorized access to files and use metadata to map blocks into files for the client. However, this causes a bottleneck issue with all the data passing through one point. Another issue is managing the metadata. Metadata is shared among separate metadata servers remote from the hosts. Space allocation management on different storage system layers and applications that add policy and management metadata individually is spread throughout the system. So this results in the metadata becoming very hard to manage.

SAN's on the other hand offer file systems that are distributed, but provide a single system image of the file system. This means that a local user need not be concerned with where the data is physically stored, since a level of abstraction separates the user from the physical location of the data. This eliminates the bottleneck of NAS. In the past, SANs were implemented on private fiber channel networks, which were designed to emulate local storage media. As long as the network remained exclusive, it could be assumed that all the clients could be trusted, so security was not a primary concern. The lack of security concern is one of the main reasons that block storage was a viable option for SANs of the past. Modern SANs can serve a much larger set of users, not all of whom can or should be trusted. This, in addition to the possible adoption of IP based SAN solutions, make data security a primary concern[[#Foot6|6]]. Object stores can make user privilege management a much more manageable task, since each object can 'know' who is allowed to access it.

Object storage provides the ability to operate a SAN setup with direct access to data while offering better security and scalability with metadata. Each object comes with a set of access rules given to it by the management server and metadata is associated and stored directly with each data object and is automatically carried between layers and across devices. Space allocation and management metadata are the responsibility of the storage device.[[#Foot1|1]] This allows metadata layers to be folded, reducing server overhead and processing, and allows for larger clusters of storage compared with traditional block-based interfaces.

=== Integrity ===
Block based file systems in archive solutions usually have no built in mechanisms for assuring data integrity. A common best practice is to conduct frequent backups, which adds to the complexity of using file systems for archiving and scalability. The mechanisms for ensuring data integrity in OSDs have mechanisms that operate differently from block store systems.

One of the major problems with storage at the block level is that if there is an error in a block, it is almost impossible to determine what part of the file system is affected. It may be the case that the error in a particular block may not even contain any data. This usually happens during a backup procedure or when a controller is organizing data.

OSDs provide a level of abstraction that hides the fact that a disk device has blocks. It no longer matters to the file system manager what kind of disk drive is being used, it only worries about managing objects. This is done through managing metadata as well as maintaining internal copies of its metadata. Hence, OSDs have knowledge of its object layout even though one or more groups of objects are on different OSDs. In this way OSDs know what kind of space is being used or unused and can scan and correct errors without losing data. In the event of a failure in recovering a file or a number of files, traditional systems may have to do a complete file system restore. However, an OSDs awareness of its object layout enables it to recover data specific to a byte range and thus restore files in an efficient manner.

OSDs have another powerful feature. Each object file has an associated hash key that is generated uniquely to the contents of the file. Thus the file can be verified for accuracy to ensure the contents remain the same and integrity to ensure the data has not been corrupted. Also it can be used for management of data to flag duplicate data.[[#Foot1|1]]

=== Security ===

Security threats can be thought of as having four quadrants. External, internal, accidental and malicious. Block based stores have a variety of ways for handling security but there are basic concepts that SAN and NAS technologies use to secure data.

SAN has traditionally run on fibre channels. [[#Foot7|7]] For the sake of security, running a SAN on fibre channels help isolate its network as they do not communicate over TCP/IP connections. However, since the SAN devices themselves do not restrict access, it's up to the network infrastructure and host system to handle its security.

Zoning and LUN masking are typical ways SAN systems could use as security measures. Zoning allocates a certain amount of storage to clients. These zones are isolated and are not allowed to communicate outside their respective zone. LUN masking is similar to zoning, however, they differ in the type of devices being used. Switches utilize zoning while disk array controllers use LUN masking. A disk array controller is a device which manages the physical disk drives and interprets them as logical unit numbers. Thus, the term LUN masking.[[#Foot8|8]]

NAS has its own vulnerabilities but as with SAN, it is only as secure as the network they operate on. NAS security is conceptually simpler than SAN. NAS environments can administer security tasks as well as control disk usage quotas. The proprietary operating system it runs on has access control configurations much like other traditional OSs that can prevent unauthorized access to data.

Unlike NAS and SAN systems, OSD devices handle security requests directly. The set of protocols used by OSD enable it to cover the four quadrants of security threats outlined above. Clients can access an OSD device by providing "cryptographically secure credentials", called capabilities, which specify a tuple (OSD name, partition ID, object ID) to identify the object.[[#Foot9|9]] This can prevent accidental or even malicious access to an OSD externally or internally.

== Real World Implementation ==
'''Check the talk page. Anil mentioned this implementation to me so I thought it would be a good idea to have something on it in the essay. Do you guys agree? If we keep it we just need to add one more reference from this section'''

Ceph is an example of a real world networked storage system based around OSDs. The Ceph developers specifically list performance, reliability, and scalability as the benefits their system offers over current solutions. Since Ceph is based on OSDs, it takes advantage of the ability for clients to interact directly with the devices, which avoids the traditional bottlenecks to performance caused by SAN controllers or NAS heads. This direct access allows Ceph to support a very large number of clients concurrently accessing data on the system. Since objects have security controls it can allow this direct access safely, unlike other network storage architectures.

== Conclusion ==
Although object storage is relatively new compared to block storage, work as progressed steadily in universities and on standards such as the ANSI T10 SCSI OSD standard. But there remains challenges to its adoption in the industry. One of which, is that it is only needed in high end business solutions at the moment, preventing it from reaching smaller businesses.[[#Foot10|10]] But as newer features are added and the standards mature we will see an increased adoption.

It is obvious however that changes do need to occur as storage grows and finer levels of management are needed for data storage. Object-based storage has evolved to fit these needs where block-based storage has stagnated. The better tools for managing the data using the rich metadata of objects, the security and data transfer speeds of NAS and SAN combined and integrity controls for backups and redundancies will be an attractive choice for storage administrators in the future.

==References==
1 Dell Product Group, 2010. Object Storage A Fresh Approach to Long-Term File Storage. [online] Dell Available at: <http://www.dell.com/downloads/global/products/pvaul/en/object-storage-overview.pdf> [Accessed 13 October 2010].

2 C. Bandulet, 2007. Object-Based Storage Devices. [online] Oracle Available at: <http://developers.sun.com/solaris/articles/osd.html> [Accessed 13 October 2010].

3 IBM 350 disk storage unit, IBM Archives. [online] IBM Available at : <http://www-03.ibm.com/ibm/history/exhibits/storage/storage_350.html> [Accessed 14 October 2010].

4 M. Mesnier, G. R. Ganger, and E. Riedel. Object-Based Storage. IEEE Communications Magazine, 41(8), August 2003.

5 TechRepublic Guest Contributor, Foundations of Network Storage, Lesson Two: NAS. [online] Available at <http://articles.techrepublic.com.com/5100-22_11-5841266.html> [Accessed 14 October 2010].

6 Satran and Teperman, Object Store Based SAN File Systems. [online] IBM Labs Available at: <http://www.research.ibm.com/haifa/projects/storage/zFS/papers/amalfi.pdf> [Accessed 14 October 2010].

7 J. Tate, F. Lucchese, R. Moore. Introduction to Storage Area Networks. [online] Available at <http://www.redbooks.ibm.com/redbooks/pdfs/sg245470.pdf> [Accessed 14 October 2010].

8 H. Yoshida. LUN Security Considerations for Storage Area Networks. [online] Available at <http://www.it.hds.com/pdf/wp91_san_lun_secur.pdf> [Accessed 14 October 2010].

9 M. Factor, D. Nagle, D. Naor, E. Riedel, J.Satran, 2005. The OSD Security Protocol. [online] Available at <http://www.research.ibm.com/haifa/projects/storage/objectstore/papers/OSDSecurityProtocol.pdf> [Accessed 14 October 2010].

10 M. Factor, K. Meth, D. Naor, O. Rodeh, J. Satran, 2005. Object storage: The future building block for storage systems. In 2nd International IEEE Symposium on Mass Storage Systems and Technologies, Sardinia [online] Available at: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.3959&rep=rep1&type=pdf> [Accessed 13 October 2010].