Soma-notes - User contributions [en]

COMP 3000 Essay 2 2010 Question 10

2010-12-03T11:19:18Z

Naseido: /* Problem*/

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasingly significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. Three controls used by the hypervisor are: reservations, where the minimum bounds are set (in absolute units); limits, where the maximum upper bound on the allocation is set (again in absolute units); and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is I/O resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the average rate of job completion. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

==Research problem==

====Problem====
Today, I/O resource allocation in modern hypervisors is very simple and somewhat primitive. Currently, an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]].SFQ(D) (Start-time Fair Queuing) performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises.

PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
Therefore, hypervisors require a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently. The mClock algorithm is the solution that Gulati, Varman, and Merchant have proposed in this paper.

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It proposes a new, more efficient algorithm for the allocation of I/O resources. Older methods were limited as they only provided proportional shares, such as SFQ. The proposed algorithm, mClock, incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a significant performance disadvantage. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock is able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods such as SFQ.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. It is a better alternative to SFQ because it supports all controls in a single algorithm, handles variable and unknown capacity, and computes quickly. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines. What mClock attempts to achieve is combining a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. Then, it is the responsibility of the weight-based scheduler to distribute the remaining IO throughput to the rest of the VMs equally.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use multiple tags based on the three controls and to dynamically decide which tag to use for scheduling, while still synchronizing idle VMs [[#Foot2|2]].

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 lines of C code. This shows that it was not very difficult to implement mClock effectively in a existing product, and to improve its performance results. Therefore, mClock would be portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles I/O resource allocation on multiple VMs in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. A positive aspect of this algorithm is that it is able to meet those controls in varying capacity. Also, it is significant that the algorithm was proven to be more efficient than existing methods at allocating IO resources in clustered architectures while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one host without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods and the mClock algorithm efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

The negative aspect of this paper was the writing style used for displaying calculations. In many sections of the essay, the calculations are embedded in a sentence which makes it difficult to read and understand them. One example is the following line: "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]].<math></math> The calculations were often about how the algorithm would calculate the resource allocation but it was not really necessary to include them; the essay is understandable without them.

In general, however, the essay is clear and not difficult to understand. As well, the case for this algorithm appears well-presented and valid.

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T11:18:50Z

Naseido: /* Background Concepts */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasingly significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. Three controls used by the hypervisor are: reservations, where the minimum bounds are set (in absolute units); limits, where the maximum upper bound on the allocation is set (again in absolute units); and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is I/O resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the average rate of job completion. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

==Research problem==

====Problem Facing====
Today, I/O resource allocation in modern hypervisors is very simple and somewhat primitive. Currently, an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]].SFQ(D) (Start-time Fair Queuing) performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises.

PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
Therefore, hypervisors require a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently. The mClock algorithm is the solution that Gulati, Varman, and Merchant have proposed in this paper.

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It proposes a new, more efficient algorithm for the allocation of I/O resources. Older methods were limited as they only provided proportional shares, such as SFQ. The proposed algorithm, mClock, incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a significant performance disadvantage. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock is able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods such as SFQ.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. It is a better alternative to SFQ because it supports all controls in a single algorithm, handles variable and unknown capacity, and computes quickly. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines. What mClock attempts to achieve is combining a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. Then, it is the responsibility of the weight-based scheduler to distribute the remaining IO throughput to the rest of the VMs equally.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use multiple tags based on the three controls and to dynamically decide which tag to use for scheduling, while still synchronizing idle VMs [[#Foot2|2]].

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 lines of C code. This shows that it was not very difficult to implement mClock effectively in a existing product, and to improve its performance results. Therefore, mClock would be portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles I/O resource allocation on multiple VMs in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. A positive aspect of this algorithm is that it is able to meet those controls in varying capacity. Also, it is significant that the algorithm was proven to be more efficient than existing methods at allocating IO resources in clustered architectures while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one host without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods and the mClock algorithm efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

The negative aspect of this paper was the writing style used for displaying calculations. In many sections of the essay, the calculations are embedded in a sentence which makes it difficult to read and understand them. One example is the following line: "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]].<math></math> The calculations were often about how the algorithm would calculate the resource allocation but it was not really necessary to include them; the essay is understandable without them.

In general, however, the essay is clear and not difficult to understand. As well, the case for this algorithm appears well-presented and valid.

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T11:17:32Z

Naseido: /* Solution */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasingly significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. Three controls used by the hypervisor are: reservations, where the minimum bounds are set (in absolute units); limits, where the maximum upper bound on the allocation is set (again in absolute units); and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is I/O resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the average rate of job completion. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises. Hypervisors require a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed.

==Research problem==

====Problem Facing====
Today, I/O resource allocation in modern hypervisors is very simple and somewhat primitive. Currently, an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]].SFQ(D) (Start-time Fair Queuing) performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises.

PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
Therefore, hypervisors require a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently. The mClock algorithm is the solution that Gulati, Varman, and Merchant have proposed in this paper.

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It proposes a new, more efficient algorithm for the allocation of I/O resources. Older methods were limited as they only provided proportional shares, such as SFQ. The proposed algorithm, mClock, incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a significant performance disadvantage. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock is able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods such as SFQ.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. It is a better alternative to SFQ because it supports all controls in a single algorithm, handles variable and unknown capacity, and computes quickly. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines. What mClock attempts to achieve is combining a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. Then, it is the responsibility of the weight-based scheduler to distribute the remaining IO throughput to the rest of the VMs equally.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use multiple tags based on the three controls and to dynamically decide which tag to use for scheduling, while still synchronizing idle VMs [[#Foot2|2]].

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 lines of C code. This shows that it was not very difficult to implement mClock effectively in a existing product, and to improve its performance results. Therefore, mClock would be portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles I/O resource allocation on multiple VMs in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. A positive aspect of this algorithm is that it is able to meet those controls in varying capacity. Also, it is significant that the algorithm was proven to be more efficient than existing methods at allocating IO resources in clustered architectures while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one host without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods and the mClock algorithm efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

The negative aspect of this paper was the writing style used for displaying calculations. In many sections of the essay, the calculations are embedded in a sentence which makes it difficult to read and understand them. One example is the following line: "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]].<math></math> The calculations were often about how the algorithm would calculate the resource allocation but it was not really necessary to include them; the essay is understandable without them.

In general, however, the essay is clear and not difficult to understand. As well, the case for this algorithm appears well-presented and valid.

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T11:14:59Z

Naseido: /* Critique */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasingly significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. Three controls used by the hypervisor are: reservations, where the minimum bounds are set (in absolute units); limits, where the maximum upper bound on the allocation is set (again in absolute units); and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is I/O resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the average rate of job completion. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises. Hypervisors require a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed.

==Research problem==

====Problem Facing====
Today, I/O resource allocation in modern hypervisors is very simple and somewhat primitive. Currently, an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]].SFQ(D) (Start-time Fair Queuing) performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises.

PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which uses all three controls in order to properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It proposes a new, more efficient algorithm for the allocation of I/O resources. Older methods were limited as they only provided proportional shares, such as SFQ. The proposed algorithm, mClock, incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a significant performance disadvantage. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock is able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods such as SFQ.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. It is a better alternative to SFQ because it supports all controls in a single algorithm, handles variable and unknown capacity, and computes quickly. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines. What mClock attempts to achieve is combining a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. Then, it is the responsibility of the weight-based scheduler to distribute the remaining IO throughput to the rest of the VMs equally.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use multiple tags based on the three controls and to dynamically decide which tag to use for scheduling, while still synchronizing idle VMs [[#Foot2|2]].

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 lines of C code. This shows that it was not very difficult to implement mClock effectively in a existing product, and to improve its performance results. Therefore, mClock would be portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles I/O resource allocation on multiple VMs in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. A positive aspect of this algorithm is that it is able to meet those controls in varying capacity. Also, it is significant that the algorithm was proven to be more efficient than existing methods at allocating IO resources in clustered architectures while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one host without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods and the mClock algorithm efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

The negative aspect of this paper was the writing style used for displaying calculations. In many sections of the essay, the calculations are embedded in a sentence which makes it difficult to read and understand them. One example is the following line: "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]].<math></math> The calculations were often about how the algorithm would calculate the resource allocation but it was not really necessary to include them; the essay is understandable without them.

In general, however, the essay is clear and not difficult to understand. As well, the case for this algorithm appears well-presented and valid.

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T11:14:16Z

Naseido: /* Contribution */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasingly significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. Three controls used by the hypervisor are: reservations, where the minimum bounds are set (in absolute units); limits, where the maximum upper bound on the allocation is set (again in absolute units); and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is I/O resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the number of jobs per hour that a system completes. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises. Hypervisors require a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed.

==Research problem==

====Problem Facing====
Today, I/O resource allocation in modern hypervisors is very simple and somewhat primitive. Currently, an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]].SFQ(D) (Start-time Fair Queuing) performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises.

PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which uses all three controls in order to properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It proposes a new, more efficient algorithm for the allocation of I/O resources. Older methods were limited as they only provided proportional shares, such as SFQ. The proposed algorithm, mClock, incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a significant performance disadvantage. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock is able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods such as SFQ.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. It is a better alternative to SFQ because it supports all controls in a single algorithm, handles variable and unknown capacity, and computes quickly. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines. What mClock attempts to achieve is combining a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. Then, it is the responsibility of the weight-based scheduler to distribute the remaining IO throughput to the rest of the VMs equally.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use multiple tags based on the three controls and to dynamically decide which tag to use for scheduling, while still synchronizing idle VMs [[#Foot2|2]].

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 lines of C code. This shows that it was not very difficult to implement mClock effectively in a existing product, and to improve its performance results. Therefore, mClock would be portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. A positive aspect of this algorithm, mClock, is that it is able to meet those controls in varying capacity. Also, it is significant that the algorithm was proven to be more efficient than existing methods at allocating IO resources in clustered architectures while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one host without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods and the mClock algorithm efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

The negative aspect of this paper was the writing style used for displaying calculations. In many sections of the essay, the calculations are embedded in a sentence which makes it difficult to read and understand them. One example is the following line: "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]].<math></math> The calculations were often about how the algorithm would calculate the resource allocation but it was not really necessary to include them; the essay is understandable without them.

In general, however, the essay is clear and not difficult to understand. As well, the case for this algorithm appears well-presented and valid.

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T11:08:01Z

Naseido: /* Problem Facing */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the number of jobs per hour that a system completes. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises. Hypervisors require a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed.

==Research problem==

====Problem Facing====
Today, I/O resource allocation in modern hypervisors is very simple and somewhat primitive. Currently, an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]].SFQ(D) (Start-time Fair Queuing) performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises.

PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which uses all three controls in order to properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited as they only provided proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a significant performance disadvantage. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods such as SFQ.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. It is a better alternative to SFQ2 3 because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines. What mClock attempts to achieve is combining a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. Then, it is the responsibility of the weight-based scheduler to distribute the remaining IO throughput to the rest of the VMs equally.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 lines of C code for the scheduling framework to use the mClock algorithm. This shows that it didn't take much to implement mClock effectively in a existing product, and to improve its performance results. This indicates that mClock could be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. A positive aspect of this algorithm, mClock, is that it is able to meet those controls in varying capacity. Also, it is significant that the algorithm was proven to be more efficient than existing methods at allocating IO resources in clustered architectures while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one host without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods and the mClock algorithm efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

The negative aspect of this paper was the writing style used for displaying calculations. In many sections of the essay, the calculations are embedded in a sentence which makes it difficult to read and understand them. One example is the following line: "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]].<math></math> The calculations were often about how the algorithm would calculate the resource allocation but it was not really necessary to include them; the essay is understandable without them.

In general, however, the essay is clear and not difficult to understand. As well, the case for this algorithm appears well-presented and valid.

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T11:07:40Z

Naseido: /* Problem Facing */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the number of jobs per hour that a system completes. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises. Hypervisors require a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed.

==Research problem==

====Problem Facing====
Today, I/O resource allocation in modern hypervisors is very simple and somewhat primitive. Currently, an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]].SFQ(D) (Start-time Fair Queuing)2 3 performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises.

PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which uses all three controls in order to properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited as they only provided proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a significant performance disadvantage. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods such as SFQ.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. It is a better alternative to SFQ2 3 because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines. What mClock attempts to achieve is combining a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. Then, it is the responsibility of the weight-based scheduler to distribute the remaining IO throughput to the rest of the VMs equally.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 lines of C code for the scheduling framework to use the mClock algorithm. This shows that it didn't take much to implement mClock effectively in a existing product, and to improve its performance results. This indicates that mClock could be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. A positive aspect of this algorithm, mClock, is that it is able to meet those controls in varying capacity. Also, it is significant that the algorithm was proven to be more efficient than existing methods at allocating IO resources in clustered architectures while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one host without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods and the mClock algorithm efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

The negative aspect of this paper was the writing style used for displaying calculations. In many sections of the essay, the calculations are embedded in a sentence which makes it difficult to read and understand them. One example is the following line: "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]].<math></math> The calculations were often about how the algorithm would calculate the resource allocation but it was not really necessary to include them; the essay is understandable without them.

In general, however, the essay is clear and not difficult to understand. As well, the case for this algorithm appears well-presented and valid.

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T11:06:32Z

Naseido: /* Problem Facing */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the number of jobs per hour that a system completes. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises. Hypervisors require a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed.

==Research problem==

====Problem Facing====
Today, I/O resource allocation in modern hypervisors is very simple and somewhat primitive. Currently, an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]].SFQ(D) (Start-time Fair Queuing)2 3 performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises.

PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which uses all three controls in order to properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited as they only provided proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a significant performance disadvantage. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods such as SFQ.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. It is a better alternative to SFQ2 3 because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines. What mClock attempts to achieve is combining a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. Then, it is the responsibility of the weight-based scheduler to distribute the remaining IO throughput to the rest of the VMs equally.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 lines of C code for the scheduling framework to use the mClock algorithm. This shows that it didn't take much to implement mClock effectively in a existing product, and to improve its performance results. This indicates that mClock could be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. A positive aspect of this algorithm, mClock, is that it is able to meet those controls in varying capacity. Also, it is significant that the algorithm was proven to be more efficient than existing methods at allocating IO resources in clustered architectures while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one host without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods and the mClock algorithm efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

The negative aspect of this paper was the writing style used for displaying calculations. In many sections of the essay, the calculations are embedded in a sentence which makes it difficult to read and understand them. One example is the following line: "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]].<math></math> The calculations were often about how the algorithm would calculate the resource allocation but it was not really necessary to include them; the essay is understandable without them.

In general, however, the essay is clear and not difficult to understand. As well, the case for this algorithm appears well-presented and valid.

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T11:05:16Z

Naseido: /* Research problem */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the number of jobs per hour that a system completes. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises. Hypervisors require a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed.

==Research problem==

====Problem Facing====
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]].SFQ(D) (Start-time Fair Queuing)2 3 performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises.

PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which uses all three controls in order to properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited as they only provided proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a significant performance disadvantage. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods such as SFQ.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. It is a better alternative to SFQ2 3 because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines. What mClock attempts to achieve is combining a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. Then, it is the responsibility of the weight-based scheduler to distribute the remaining IO throughput to the rest of the VMs equally.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 lines of C code for the scheduling framework to use the mClock algorithm. This shows that it didn't take much to implement mClock effectively in a existing product, and to improve its performance results. This indicates that mClock could be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. A positive aspect of this algorithm, mClock, is that it is able to meet those controls in varying capacity. Also, it is significant that the algorithm was proven to be more efficient than existing methods at allocating IO resources in clustered architectures while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one host without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods and the mClock algorithm efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

The negative aspect of this paper was the writing style used for displaying calculations. In many sections of the essay, the calculations are embedded in a sentence which makes it difficult to read and understand them. One example is the following line: "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]].<math></math> The calculations were often about how the algorithm would calculate the resource allocation but it was not really necessary to include them; the essay is understandable without them.

In general, however, the essay is clear and not difficult to understand. As well, the case for this algorithm appears well-presented and valid.

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T11:02:14Z

Naseido: /* Background Concepts */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the number of jobs per hour that a system completes. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performs fairly well for low-intensity workloads. However, as the workload with VMs multiplies, the constant need for faster performance and efficiency rises. Hypervisors require a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed.

==Research problem==

====Problem Facing====
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]]. What this means is that PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which can handle all kinds of controls and properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited as they only provided proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a significant performance disadvantage. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods such as SFQ.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. It is a better alternative to SFQ2 3 because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines. What mClock attempts to achieve is combining a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. Then, it is the responsibility of the weight-based scheduler to distribute the remaining IO throughput to the rest of the VMs equally.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 lines of C code for the scheduling framework to use the mClock algorithm. This shows that it didn't take much to implement mClock effectively in a existing product, and to improve its performance results. This indicates that mClock could be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. A positive aspect of this algorithm, mClock, is that it is able to meet those controls in varying capacity. Also, it is significant that the algorithm was proven to be more efficient than existing methods at allocating IO resources in clustered architectures while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one host without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods and the mClock algorithm efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

The negative aspect of this paper was the writing style used for displaying calculations. In many sections of the essay, the calculations are embedded in a sentence which makes it difficult to read and understand them. One example is the following line: "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]].<math></math> The calculations were often about how the algorithm would calculate the resource allocation but it was not really necessary to include them; the essay is understandable without them.

In general, however, the essay is clear and not difficult to understand. As well, the case for this algorithm appears well-presented and valid.

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T10:47:34Z

Naseido: /* Critique */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the number of jobs per hour that a system completes. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performed fairly well for low-intensity workloads. However, as the workload with VMs multiplied, the constant need for faster performance and efficiency rose. Hypervisors required a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed, to aid hypervisors.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Evidently, it is the better alternate to SFQ[[#Foot2|2]] [[#Foot3|3]] because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

What mClock is basically trying to achieve is to combine a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. All thats left is that the weight-based scheduler distribute the remaining IO throughput to the rest of the VMs equally.

==Research problem==

====Problem Facing====
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]]. What this means is that PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which can handle all kinds of controls and properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited as they only provided proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a significant performance disadvantage. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods such as SFQ.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. It is a better alternative to SFQ2 3 because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines. What mClock attempts to achieve is combining a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. Then, it is the responsibility of the weight-based scheduler to distribute the remaining IO throughput to the rest of the VMs equally.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 lines of C code for the scheduling framework to use the mClock algorithm. This shows that it didn't take much to implement mClock effectively in a existing product, and to improve its performance results. This indicates that mClock could be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. A positive aspect of this algorithm, mClock, is that it is able to meet those controls in varying capacity. Also, it is significant that the algorithm was proven to be more efficient than existing methods at allocating IO resources in clustered architectures while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one host without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods and the mClock algorithm efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

The negative aspect of this paper was the writing style used for displaying calculations. In many sections of the essay, the calculations are embedded in a sentence which makes it difficult to read and understand them. One example is the following line: "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]].<math></math> The calculations were often about how the algorithm would calculate the resource allocation but it was not really necessary to include them; the essay is understandable without them.

In general, however, the essay is clear and not difficult to understand. As well, the case for this algorithm appears well-presented and valid.

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T10:36:06Z

Naseido: /* Contribution */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the number of jobs per hour that a system completes. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performed fairly well for low-intensity workloads. However, as the workload with VMs multiplied, the constant need for faster performance and efficiency rose. Hypervisors required a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed, to aid hypervisors.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Evidently, it is the better alternate to SFQ[[#Foot2|2]] [[#Foot3|3]] because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

What mClock is basically trying to achieve is to combine a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. All thats left is that the weight-based scheduler distribute the remaining IO throughput to the rest of the VMs equally.

==Research problem==

====Problem Facing====
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]]. What this means is that PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which can handle all kinds of controls and properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited as they only provided proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a significant performance disadvantage. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods such as SFQ.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. It is a better alternative to SFQ2 3 because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines. What mClock attempts to achieve is combining a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. Then, it is the responsibility of the weight-based scheduler to distribute the remaining IO throughput to the rest of the VMs equally.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 lines of C code for the scheduling framework to use the mClock algorithm. This shows that it didn't take much to implement mClock effectively in a existing product, and to improve its performance results. This indicates that mClock could be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity. The good thing about this is that the algorithm proves to be (more efficient compare to existing methods) efficient in clustered architectures due to better resource allocation while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one HOST without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods; the mClock algorithm which efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

One aspect of the writing style , "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]]. The style of displaying these calculations depicts a messy, unorganized styled.<math></math>

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T10:35:00Z

Naseido: /* Contribution */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the number of jobs per hour that a system completes. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performed fairly well for low-intensity workloads. However, as the workload with VMs multiplied, the constant need for faster performance and efficiency rose. Hypervisors required a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed, to aid hypervisors.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Evidently, it is the better alternate to SFQ[[#Foot2|2]] [[#Foot3|3]] because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

What mClock is basically trying to achieve is to combine a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. All thats left is that the weight-based scheduler distribute the remaining IO throughput to the rest of the VMs equally.

==Research problem==

====Problem Facing====
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]]. What this means is that PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which can handle all kinds of controls and properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited as they only provided proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a significant performance disadvantage. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods such as SFQ.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. It is a better alternative to SFQ2 3 because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines. What mClock attempts to achieve is combining a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. Then, it is the responsibility of the weight-based scheduler to distribute the remaining IO throughput to the rest of the VMs equally.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 lines of C code for the scheduling framework to use the mClock algorithms. This shows that it didn't take much to implement mClock effectively in a existing product, and improve its performance results. This indicates that mClock could be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity. The good thing about this is that the algorithm proves to be (more efficient compare to existing methods) efficient in clustered architectures due to better resource allocation while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one HOST without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods; the mClock algorithm which efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

One aspect of the writing style , "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]]. The style of displaying these calculations depicts a messy, unorganized styled.<math></math>

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T10:32:03Z

Naseido: /* Contribution */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the number of jobs per hour that a system completes. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performed fairly well for low-intensity workloads. However, as the workload with VMs multiplied, the constant need for faster performance and efficiency rose. Hypervisors required a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed, to aid hypervisors.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Evidently, it is the better alternate to SFQ[[#Foot2|2]] [[#Foot3|3]] because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

What mClock is basically trying to achieve is to combine a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. All thats left is that the weight-based scheduler distribute the remaining IO throughput to the rest of the VMs equally.

==Research problem==

====Problem Facing====
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]]. What this means is that PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which can handle all kinds of controls and properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited as they only provided proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a significant performance disadvantage. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods such as SFQ.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. It is a better alternative to SFQ2 3 because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines. What mClock attempts to achieve is combining a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. Then, it is the responsibility of the weight-based scheduler to distribute the remaining IO throughput to the rest of the VMs equally.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that “VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 lines of C code for the scheduling framework to use mClocks algorithms. This shows that it didn't take much to implement mClock effectively in a existing product, and improve its performance results. This indicates that mClock could be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity. The good thing about this is that the algorithm proves to be (more efficient compare to existing methods) efficient in clustered architectures due to better resource allocation while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one HOST without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods; the mClock algorithm which efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

One aspect of the writing style , "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]]. The style of displaying these calculations depicts a messy, unorganized styled.<math></math>

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T10:17:38Z

Naseido: /* References */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the number of jobs per hour that a system completes. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performed fairly well for low-intensity workloads. However, as the workload with VMs multiplied, the constant need for faster performance and efficiency rose. Hypervisors required a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed, to aid hypervisors.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Evidently, it is the better alternate to SFQ[[#Foot2|2]] [[#Foot3|3]] because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

What mClock is basically trying to achieve is to combine a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. All thats left is that the weight-based scheduler distribute the remaining IO throughput to the rest of the VMs equally.

==Research problem==

====Problem Facing====
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]]. What this means is that PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which can handle all kinds of controls and properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited solely by providing proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a terrible performance loss. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods; for instance, SFQ.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that “VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 of C code for the scheduling framework to use mClocks algorithms. This shows that it didn't take much to implement mClock effectively in a existing product, and improving its performance results. This seems like mClock can be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity. The good thing about this is that the algorithm proves to be (more efficient compare to existing methods) efficient in clustered architectures due to better resource allocation while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one HOST without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods; the mClock algorithm which efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

One aspect of the writing style , "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]]. The style of displaying these calculations depicts a messy, unorganized styled.<math></math>

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.


6
A. S. Tanenbaum. Modern Operating Systems: Third Edition. Pearson Education, 2008.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T10:12:27Z

Naseido: /* Background Concepts */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the number of jobs per hour that a system completes. In general, a system is considered more efficient if it has a higher throughput [[#Foot6|6]]. In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performed fairly well for low-intensity workloads. However, as the workload with VMs multiplied, the constant need for faster performance and efficiency rose. Hypervisors required a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed, to aid hypervisors.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Evidently, it is the better alternate to SFQ[[#Foot2|2]] [[#Foot3|3]] because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

What mClock is basically trying to achieve is to combine a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. All thats left is that the weight-based scheduler distribute the remaining IO throughput to the rest of the VMs equally.

==Research problem==

====Problem Facing====
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]]. What this means is that PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which can handle all kinds of controls and properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited solely by providing proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a terrible performance loss. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods; for instance, SFQ.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that “VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 of C code for the scheduling framework to use mClocks algorithms. This shows that it didn't take much to implement mClock effectively in a existing product, and improving its performance results. This seems like mClock can be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity. The good thing about this is that the algorithm proves to be (more efficient compare to existing methods) efficient in clustered architectures due to better resource allocation while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one HOST without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods; the mClock algorithm which efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

One aspect of the writing style , "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]]. The style of displaying these calculations depicts a messy, unorganized styled.<math></math>

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T10:11:40Z

Naseido: /* Background Concepts */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from university students to large gaming firms. One of the key issues with virtual machines is ensuring that all shared resources on the machine are utilized equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''Throughput''' is the number of jobs per hour that a system completes. In general, a system is considered more effecient if it has higher throughput.[[#Foot6|6]] In this paper, this term is used to discuss the fact that throughput varies, and the number of jobs a system wishes to complete varies as well. Therefore it is necessary to take throughput into account when scheduling IO resources.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performed fairly well for low-intensity workloads. However, as the workload with VMs multiplied, the constant need for faster performance and efficiency rose. Hypervisors required a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed, to aid hypervisors.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Evidently, it is the better alternate to SFQ[[#Foot2|2]] [[#Foot3|3]] because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

What mClock is basically trying to achieve is to combine a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. All thats left is that the weight-based scheduler distribute the remaining IO throughput to the rest of the VMs equally.

==Research problem==

====Problem Facing====
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]]. What this means is that PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which can handle all kinds of controls and properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited solely by providing proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a terrible performance loss. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods; for instance, SFQ.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that “VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 of C code for the scheduling framework to use mClocks algorithms. This shows that it didn't take much to implement mClock effectively in a existing product, and improving its performance results. This seems like mClock can be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity. The good thing about this is that the algorithm proves to be (more efficient compare to existing methods) efficient in clustered architectures due to better resource allocation while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one HOST without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods; the mClock algorithm which efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

One aspect of the writing style , "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]]. The style of displaying these calculations depicts a messy, unorganized styled.<math></math>

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T10:03:31Z

Naseido: /* Background Concepts */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from large gaming firms to university students. One of the key issues with virtual machines is ensuring that all shared resources on the machine are shared equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''IOPS''', which stands for Input/ Output per second, is the unit used in this paper to represent how much IO resources a particular VM is allocated.It gives an idea of how quickly a storage request can be fulfilled by the storage system. Typically the lower the IOPS the longer the requesting VM has to wait.

'''SFQ''', or Start-Time Fair Queuing, is the traditional scheduler currently used to allocate resources. It follows a proportional-sharing algorithm which divides up the total throughput between the VMs in proportion to their assigned shares. The issue with this is that it does not consider reservations or limits in its allocation.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performed fairly well for low-intensity workloads. However, as the workload with VMs multiplied, the constant need for faster performance and efficiency rose. Hypervisors required a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed, to aid hypervisors.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Evidently, it is the better alternate to SFQ[[#Foot2|2]] [[#Foot3|3]] because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

What mClock is basically trying to achieve is to combine a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. All thats left is that the weight-based scheduler distribute the remaining IO throughput to the rest of the VMs equally.

==Research problem==

====Problem Facing====
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]]. What this means is that PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which can handle all kinds of controls and properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited solely by providing proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a terrible performance loss. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods; for instance, SFQ.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that “VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 of C code for the scheduling framework to use mClocks algorithms. This shows that it didn't take much to implement mClock effectively in a existing product, and improving its performance results. This seems like mClock can be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity. The good thing about this is that the algorithm proves to be (more efficient compare to existing methods) efficient in clustered architectures due to better resource allocation while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one HOST without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods; the mClock algorithm which efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

One aspect of the writing style , "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]]. The style of displaying these calculations depicts a messy, unorganized styled.<math></math>

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T09:57:35Z

Naseido: /* Background Concepts */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from large gaming firms to university students. One of the key issues with virtual machines is ensuring that all shared resources on the machine are shared equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''IOPS''', which stands for Input/ Output per second, is the unit used in this paper to represent how much IO resources a particular VM is allocated.It gives an idea of how quickly a storage request can be fulfilled by the storage system. Typically the lower the IOPS the longer the requesting VM has to wait.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performed fairly well for low-intensity workloads. However, as the workload with VMs multiplied, the constant need for faster performance and efficiency rose. Hypervisors required a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed, to aid hypervisors.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Evidently, it is the better alternate to SFQ[[#Foot2|2]] [[#Foot3|3]] because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

What mClock is basically trying to achieve is to combine a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. All thats left is that the weight-based scheduler distribute the remaining IO throughput to the rest of the VMs equally.

==Research problem==

====Problem Facing====
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]]. What this means is that PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which can handle all kinds of controls and properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited solely by providing proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a terrible performance loss. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods; for instance, SFQ.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that “VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 of C code for the scheduling framework to use mClocks algorithms. This shows that it didn't take much to implement mClock effectively in a existing product, and improving its performance results. This seems like mClock can be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity. The good thing about this is that the algorithm proves to be (more efficient compare to existing methods) efficient in clustered architectures due to better resource allocation while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one HOST without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods; the mClock algorithm which efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

One aspect of the writing style , "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]]. The style of displaying these calculations depicts a messy, unorganized styled.<math></math>

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T08:13:28Z

Naseido: /* Background Concepts */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

'''Virtual machines''' (VMs) are becoming increasing significant as they are used by everyone from large gaming firms to university students. One of the key issues with virtual machines is ensuring that all shared resources on the machine are shared equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

'''Hypervisors''' are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

'''IOPS''', which stands for Input/ Output per Second, is the unit used to represent how much IO resources a particular VM is allocated as it gives an idea of how quickly a storage request can be fulfilled by the storage system. Typically the lower the IOPS the longer the requesting server has to wait.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performed fairly well for low-intensity workloads. However, as the workload with VMs multiplied, the constant need for faster performance and efficiency rose. Hypervisors required a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed, to aid hypervisors.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Evidently, it is the better alternate to SFQ[[#Foot2|2]] [[#Foot3|3]] because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

What mClock is basically trying to achieve is to combine a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. All thats left is that the weight-based scheduler distribute the remaining IO throughput to the rest of the VMs equally.

==Research problem==

====Problem Facing====
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]]. What this means is that PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which can handle all kinds of controls and properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited solely by providing proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a terrible performance loss. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods; for instance, SFQ.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that “VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

mClock was implemented on a modified version of VMware ESX server hypervisor [[#Foot4|4]] [[#Foot5|5]]. This modification only took around 200 of C code for the scheduling framework to use mClocks algorithms. This shows that it didn't take much to implement mClock effectively in a existing product, and improving its performance results. This seems like mClock can be very portable and easy to implement in other hypervisors as well.

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity. The good thing about this is that the algorithm proves to be (more efficient compare to existing methods) efficient in clustered architectures due to better resource allocation while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one HOST without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods; the mClock algorithm which efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

One aspect of the writing style , "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]]. The style of displaying these calculations depicts a messy, unorganized styled.<math></math>

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.


4
VMware ESX Server User Manual, December 2007.
VMware Inc.


5
VMware, Inc. Introduction to VMware Infrastructure.
2007. http://www.vmware.com/support/pubs/.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T06:08:13Z

Naseido: /* Background Concepts */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

Virtual machine (VM) usage is increasing significantly on a daily basis and is used by everyone from large gaming firms to university students. One of the key issues with virtual machines is ensuring that all shared resources on the machine are shared equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

Hypervisors are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performed fairly well for low-intensity workloads. However, as the workload with VMs multiplied, the constant need for faster performance and efficiency rose. Hypervisors required a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed, to aid hypervisors.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Evidently, it is the better alternate to SFQ[[#Foot2|2]] [[#Foot3|3]] because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

What mClock is basically trying to achieve is to combine a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. All thats left is that the weight-based scheduler distribute the remaining IO throughput to the rest of the VMs equally.

==Research problem==

====Problem Facing====
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]]. What this means is that PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which can handle all kinds of controls and properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited solely by providing proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a terrible performance loss. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods; for instance, SFQ.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that “VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity. The good thing about this is that the algorithm proves to be (more efficient compare to existing methods) efficient in clustered architectures due to better resource allocation while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one HOST without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods; the mClock algorithm which efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

One aspect of the writing style , "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]]. The style of displaying these calculations depicts a messy, unorganized styled.<math></math>

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.

COMP 3000 Essay 2 2010 Question 10

2010-12-03T05:56:50Z

Naseido: /* Background Concepts */

'''mClock: Handling Throughput Variability for Hypervisor I/O Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor I/O Scheduling]

'''Authors''':

Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

Virtual machine (VM) usage is increasing significantly on a daily basis and is used by everyone from large gaming firms to university students. One of the key issues with virtual machines is ensuring that all shared resources on the machine are shared equitably. In order to do this, and to provide the illusion that the virtual machine is running on its own hardware, a hypervisor is required.

Hypervisors are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used by the hypervisor are: reservation, where the minimum bounds are set; limits, where the maximum upper bound on the allocation is set; and shares, which proportionally allocate the resources according to the weight of each VM. These three controls have been supported for CPU and memory resource allocation since 2003. However, the current issue is IO resource allocation. Currently, when more VMs are added to a host, the contention for input/output (I/O) resources can suddenly lower a VM’s allocation. Also, the available throughput can change with time, and adjustments to allocations must be made dynamically.

SFQ(D) (Start-time Fair Queuing)[[#Foot2|2]] [[#Foot3|3]] performed fairly well for low-intensity workloads. However, as the workload with VMs multiplied, the constant need for faster performance and efficiency rose. Hypervisors required a better resource-allocation algorithm in order to meet the need for high performance VMs running concurrently; mClock was the answer Gulati, Varman, and Merchant proposed, to aid hypervisors.

Hypervisors are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used are reservation where the minimum bounds are set, the limit where the maximum upper bound on the allocation is set, and shares which proportionally allocate the resources according to the certain weight each VM has, and also depending on the reservation and upper bound limits. However the contention for input/output (I/O) resources can suddenly lower a VM’s allocation; the available throughput can change with time, and adjustments to allocations must be made dynamically.

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Evidently, it is the better alternate to SFQ[[#Foot2|2]] [[#Foot3|3]] because it supports all controls in a single algorithm, handles variable and unknown capacity, and is fast to compute. The algorithm does not weaken the performance level as each VM gets added on, and mClock reservations are met. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

What mClock is basically trying to achieve is to combine a constraint-based scheduler and a weight-based scheduler. Making sure the minimum IO reservation limit is consistently met, yet not over the upper bound limit, would be handled by the constraint-based scheduler. All thats left is that the weight-based scheduler distribute the remaining IO throughput to the rest of the VMs equally.

==Research problem==

====Problem Facing====
Today, we use a very primitive kind of I/O resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is being used to allocate I/O resources to each VM running on a particular storage device. Unfortunately, the I/O resource allocation algorithm of the hosts use a fair-scheduler called SFQ [[#Foot2|2]] [[#Foot3|3]]. What this means is that PARDA allocates I/O resources to VMs proportional to the number of I/O shares on the host, but each host uses a fair scheduler which divides the I/O shares amongst the VMs equally. This leads to the main problem; whenever another VM is added or another background application is run on one of the VMs, all other VMs suffer a huge performance loss; a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, however as soon as the stress load on the shared storage device increases, the application might fail to run smoothly, or worse, crash.

====Solution====
We need an algorithm which can handle all kinds of controls and properly allocate resources for each request. To resolve this issue of resource allocation and performance, mClock is introduced and tested against SFQ[[#Foot3|3]].

==Contribution==

This paper addresses the current limitations of I/O resource allocation for hypervisors. It has proposed a new and more efficient algorithm to allocate I/O resources. Older methods were limited solely by providing proportional shares, such as SFQ. mClock incorporates proportional shares, as well as a minimum reservation of I/O resources, and a maximum reservation.

Older methods of I/O resource allocation had a terrible performance loss. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop significantly. Also, these older methods provided unreliable I/O management of hypervisors. Conversely, mClock was able to present VMs with a guaranteed minimum reservation of I/O resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance and efficiency level, compared to older methods; for instance, SFQ.

The mClock algorithm uses a tag-based scheduler with some modifications; like the tag-based schedulers all I/O requests are assigned tags and scheduled in order of their tag values, the modifications includes the ability to use “multiple tags based on three controls and dynamically decide which tag to use for scheduling, while still synchronizing idle clients”. [[#Foot2|2]]

mClock also uses both constraint-based and weight-based schedulers. Constraint-based scheduler makes sure that “VMs receive their minimum reserved service and no more than their upper limit in a time interval. Weight-based scheduler allocates the remaining throughput to achieve proportional sharing”. [[#Foot2|2]]

Another contribution was the introduction of Distributed mClock or dmClock which basically runs an altered version of mClock at each server. dmClock is mainly used for cluster-based storage system which are rising as centralized disk arrays, and better than the alternates in terms of cost. The reservation in this modified algorithm gives higher preference to non-idle VMs to attain high performance. dmClock proved to be effective with a simple, modified mClock algorithm which does not require complex synchronizations between servers.

==Critique==

The article introduces the mClock algorithm which handles (I/O resource allocation on) multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity. The good thing about this is that the algorithm proves to be (more efficient compare to existing methods) efficient in clustered architectures due to better resource allocation while providing greater isolation between VMs. mClock allows users to be comfortable when working with multiple VMs on one HOST without the constant worry of performance levels, with each VM add-on.

The paper proposes a better, and effective alternative to SFQ and other older methods; the mClock algorithm which efficiently handles multiple VMs in a throughput environment (LUN, PARDA).

One aspect of the writing style , "For a small reference I/O size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1" [[#Foot2|2]]. The style of displaying these calculations depicts a messy, unorganized styled.<math></math>

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]


3
P. Goyal, H. M. Vin, and H. Cheng. Start-Time Fair
Queuing: A scheduling algorithm for integrated services
packet switching networks. Technical Report CS-TR-96-
02, UT Austin, January 1996.

COMP 3000 Essay 2 2010 Question 10

2010-12-02T00:35:40Z

Naseido: /* Critique */

'''mClock: Handling Throughput Variability for Hypervisor IO Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor IO Scheduling]

'''Authors''':

: Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

: Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

: Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

Hypervisors are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used are: reservation, where the minimum bounds are set, the limit where the maximum upper bound on the allocation is set, and shares which proportionally allocate the resources according to the certain weight each VM has, and also depending on the reservation and upper bound limits. This is interesting because virtualization has been very successful; people are comfortable with putting multiple VM on one HOST without worrying about the performance of each VM on another. However the contention for I/O resources can suddenly lower a VM’s allocation; the available throughput can change with time, and adjustments to allocations must be made dynamically. mClock is a better alternate because it supports all controls in a single algorithm, handles variable and unknown capacity, and fast to compute. This is interesting because there is a limit control on VM allocation, it does not weaken as each VM gets added on, and mClock reservations are met.

: more about mclock here

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

==Research problem==
We use today, a very primitive kind of IO resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is used to allocate IO resources to each VM running on a particular storage device. Unfortunately, the IO resource allocation algorithm of the hosts use a fair-scheduler called SFQ (Start-time Fair Queuing) [[#Foot2|2]]. What this means is that PARDA allocates IO resources to VMs proportional to the number of IO shares on the host, but each host uses a fair scheduler which divides the IO shares amongst the VMs equally. This leads to the problem that whenever another VM is added or another background application is run on one of the VMs, all the other VMs suffer a huge performance lose. In the case of adding another VM, there is a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, but as soon as the load on the shared storage device increases, the application would run poorly, or could potentially crash.

==Contribution==
This paper addresses the current limitations of IO resource allocation for hypervisors. The paper has proposed a new and more efficient algorithm to allocate IO resources. Older methods were limited solely by providing proportional shares. mClock incorporates proportional shares, as well as a minimum reservation of IO resources, and a maximum reservation.

Older methods of IO resource allocation had a terrible performance lose. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop considerably. Older methods provided unreliable IO management of hypervisors

mClock was able to present VMs with a guaranteed minimum reservation of IO resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance.

"dmClock (used for cluster-based storage systems) runs a modified version of mClock at each server. There is only one modification to the algorithm to account for the distributed model in the Tag-Assignment component." - from the paper

==Critique==
The article introduces the mClock algorithm which handles multiple VMs in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity.

The good thing about this is that the algorithm proves efficient in clustered architectures. Moreover, it provides greater isolation between VMs.

In this paper there were many terms that were used but never explained, such as orders (used in the graphs), LUN, PARDA, etc.
One aspect of the writing style , "For a small reference IO size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1". To me this was very messy and made me skip through the calculations part of the sentence.<math></math>

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]

COMP 3000 Essay 2 2010 Question 10

2010-12-02T00:24:06Z

Naseido: /* Critique */

'''mClock: Handling Throughput Variability for Hypervisor IO Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor IO Scheduling]

'''Authors''':

: Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

: Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

: Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

Hypervisors are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used are: reservation, where the minimum bounds are set, the limit where the maximum upper bound on the allocation is set, and shares which proportionally allocate the resources according to the certain weight each VM has, and also depending on the reservation and upper bound limits. This is interesting because virtualization has been very successful; people are comfortable with putting multiple VM on one HOST without worrying about the performance of each VM on another. However the contention for I/O resources can suddenly lower a VM’s allocation; the available throughput can change with time, and adjustments to allocations must be made dynamically. mClock is a better alternate because it supports all controls in a single algorithm, handles variable and unknown capacity, and fast to compute. This is interesting because there is a limit control on VM allocation, it does not weaken as each VM gets added on, and mClock reservations are met.

: more about mclock here

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

==Research problem==
We use today, a very primitive kind of IO resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is used to allocate IO resources to each VM running on a particular storage device. Unfortunately, the IO resource allocation algorithm of the hosts use a fair-scheduler called SFQ (Start-time Fair Queuing) [[#Foot2|2]]. What this means is that PARDA allocates IO resources to VMs proportional to the number of IO shares on the host, but each host uses a fair scheduler which divides the IO shares amongst the VMs equally. This leads to the problem that whenever another VM is added or another background application is run on one of the VMs, all the other VMs suffer a huge performance lose. In the case of adding another VM, there is a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, but as soon as the load on the shared storage device increases, the application would run poorly, or could potentially crash.

==Contribution==
This paper addresses the current limitations of IO resource allocation for hypervisors. The paper has proposed a new and more efficient algorithm to allocate IO resources. Older methods were limited solely by providing proportional shares. mClock incorporates proportional shares, as well as a minimum reservation of IO resources, and a maximum reservation.

Older methods of IO resource allocation had a terrible performance lose. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop considerably. Older methods provided unreliable IO management of hypervisors

mClock was able to present VMs with a guaranteed minimum reservation of IO resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance.

"dmClock (used for cluster-based storage systems) runs a modified version of mClock at each server. There is only one modification to the algorithm to account for the distributed model in the Tag-Assignment component." - from the paper

==Critique==
The article introduces the mClock algorithm which handles multiple VMs in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity.

The good thing about this is that the algorithm proves efficient in clustered architectures. Moreover, it provides greater isolation between VMs.

In this paper there were many terms that were used but never explained, such as orders (used in the graphs), LUN, PARDA, etc. Also, I did not like the way the calculations were written in sentences, "For a small reference IO size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1". To me this was very messy and made me skip through the calculations part of the sentence.<math></math>
©

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]

COMP 3000 Essay 2 2010 Question 10

2010-12-01T23:26:50Z

Naseido: /* Background Concepts */

'''mClock: Handling Throughput Variability for Hypervisor IO Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor IO Scheduling]

'''Authors''':

: Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

: Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

: Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==

Hypervisors are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used are: reservation, where the minimum bounds are set, the limit where the maximum upper bound on the allocation is set, and shares which proportionally allocate the resources according to the certain weight each VM has, and also depending on the reservation and upper bound limits. This is interesting because virtualization has been very successful; people are comfortable with putting multiple VM on one HOST without worrying about the performance of each VM on another. However the contention for I/O resources can suddenly lower a VM’s allocation; the available throughput can change with time, and adjustments to allocations must be made dynamically. mClock is a better alternate because it supports all controls in a single algorithm, handles variable and unknown capacity, and fast to compute. This is interesting because there is a limit control on VM allocation, it does not weaken as each VM gets added on, and mClock reservations are met.

: more about mclock here

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

==Research problem==
We use today, a very primitive kind of IO resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is used to allocate IO resources to each VM running on a particular storage device. Unfortunately, the IO resource allocation algorithm of the hosts use a fair-scheduler called SFQ (Start-time Fair Queuing) [[#Foot2|2]]. What this means is that PARDA allocates IO resources to VMs proportional to the number of IO shares on the host, but each host uses a fair scheduler which divides the IO shares amongst the VMs equally. This leads to the problem that whenever another VM is added or another background application is run on one of the VMs, all the other VMs suffer a huge performance lose. In the case of adding another VM, there is a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, but as soon as the load on the shared storage device increases, the application would run poorly, or could potentially crash.

==Contribution==
This paper addresses the current limitations of IO resource allocation for hypervisors. The paper has proposed a new and more efficient algorithm to allocate IO resources. Older methods were limited solely by providing proportional shares. mClock incorporates proportional shares, as well as a minimum reservation of IO resources, and a maximum reservation.

Older methods of IO resource allocation had a terrible performance lose. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop considerably. Older methods provided unreliable IO management of hypervisors

mClock was able to present VMs with a guaranteed minimum reservation of IO resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance.

"dmClock (used for cluster-based storage systems) runs a modified version of mClock at each server. There is only one modification to the algorithm to account for the distributed model in the Tag-Assignment component." - from the paper

==Critique==
The article introduces the mClock algorithm which handles multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity. The good thing about this is that the algorithm proves efficient in clustered architectures. Moreover, it provides greater isolation between VMs.

In this paper there were many terms that were used but never explained, such as orders (used in the graphs), LUN, PARDA, etc. Also, I did not like the way the calculations were written in sentences, "For a small reference IO size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1". To me this was very messy and made me skip through the calculations part of the sentence.

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]

COMP 3000 Essay 2 2010 Question 10

2010-12-01T17:13:52Z

Naseido: /* Critique */

'''mClock: Handling Throughput Variability for Hypervisor IO Scheduling'''

==Paper==

[http://www.usenix.org/events/osdi10/tech/full_papers/Gulati.pdf mClock: Handling Throughput Variability for Hypervisor IO Scheduling]

'''Authors''':

: Ajay Gulati VMware Inc. Palo Alto, CA, 94304 agulati@vmware.com

: Arif Merchant HP Labs Palo Alto, CA 94304 arif.merchant@acm.org

: Peter J. Varman Rice University Houston, TX, 77005 pjv@rice.edu

==Background Concepts==
Hypervisors are responsible for multiplexing hardware resources between virtual machines while providing isolation to an extent, using resource management. The three controls used are reservation where the minimum bounds are set, the limit where the maximum upper bound on the allocation is set, and shares which proportionally allocate the resources according to the certain weight each VM has, and also depending on the reservation and upper bound limits. This is interesting because virtualization has been very successful; people are comfortable with putting multiple VM on one HOST without worrying about the performance of each VM on another. However the contention for I/O resources can suddenly lower a VM’s allocation; the available throughput can change with time, and adjustments to allocations must be made dynamically. mClock is a better alternate because it supports all controls in a single algorithm, handles variable and unknown capacity, and fast to compute. This is interesting because there is a limit control on VM allocation, it does not weaken as each VM gets added on, and mClock reservations are met.

: more about mclock here

mClock is a resource-allocation algorithm that helps hypervisors manage I/O requests from multiple virtual machines simultaneously. Essentially, mClock dynamically adjusts the proportions of resources each VM receives based on how active each VM currently is. While mClock constantly changes the physical resource allocation to each VM, it lets each VM hold onto the illusion that it has full control of all system resources. As a result, performance can be increased for VMs that need it, without letting the others know that “their” resources are being distributed to other machines.

==Research problem==
We use today, a very primitive kind of IO resource allocation in modern hypervisors. Currently an algorithm called PARDA (Proportional Allocation of Resources in Distributed storage Access) [[#Foot1|1]] is used to allocate IO resources to each VM running on a particular storage device. Unfortunately, the IO resource allocation algorithm of the hosts use a fair-scheduler called SFQ (Start-time Fair Queuing) [[#Foot2|2]]. What this means is that PARDA allocates IO resources to VMs proportional to the number of IO shares on the host, but each host uses a fair scheduler which divides the IO shares amongst the VMs equally. This leads to the problem that whenever another VM is added or another background application is run on one of the VMs, all the other VMs suffer a huge performance lose. In the case of adding another VM, there is a 40% performance drop. This is completely unacceptable when applications have minimum performance requirements to run effectively. An application with minimum resource requirements can be running fine on any given VM, but as soon as the load on the shared storage device increases, the application would run poorly, or could potentially crash.

==Contribution==
This paper addresses the current limitations of IO resource allocation for hypervisors. The paper has proposed a new and more efficient algorithm to allocate IO resources. Older methods were limited solely by providing proportional shares. mClock incorporates proportional shares, as well as a minimum reservation of IO resources, and a maximum reservation.

Older methods of IO resource allocation had a terrible performance lose. Whenever the load on the shared storage device was increased, or when another VM was added, the performance of all hosts would drop considerably. Older methods provided unreliable IO management of hypervisors

mClock was able to present VMs with a guaranteed minimum reservation of IO resources. This means that application performance will never drop below a certain point. This provides much better application stability on each of the VMs, and better overall performance.

"dmClock (used for cluster-based storage systems) runs a modified version of mClock at each server. There is only one modification to the algorithm to account for the distributed model in the Tag-Assignment component." - from the paper

==Critique==
The article introduces the mClock algorithm which handles multiple VM in a variable throughput environment. The Quality of Service (QoS) requirements for a VM are expressed as a minimum reservation, a maximum limit, and a proportional share. This algorithm, mClock, is able to meet those controls in varying capacity. The good thing about this is that the algorithm proves efficient in clustered architectures. Moreover, it provides greater isolation between VMs.

In this paper there were many terms that were used but never explained, such as orders (used in the graphs), LUN, PARDA, etc. Also, I did not like the way the calculations were written in sentences, "For a small reference IO size of 8KB and using typical values for mechanical delay Tm = 5ms and peak transfer rate, Bpeak = 60 MB/s, the numerator = Lat1*(1 + 8/300) ≈ Lat1". To me this was very messy and made me skip through the calculations part of the sentence.

==References==

1 A. Gulati, I. Ahmad, and C. Waldspurger. PARDA: Proportional
Allocation of Resources in Distributed Storage
Access. In (FAST ’09) Proceedings of the Seventh Usenix
Conference on File and Storage Technologies, Feb 2009.

2 W. Jin, J. S. Chase, and J. Kaur. Interposed proportional
sharing for a storage service utility. In ACM SIGMET-
RICS, 2004. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.7012&rep=rep1&type=pdf Interposed proportional sharing for a storage service utility]

Talk:COMP 3000 Essay 2 2010 Question 10

2010-11-16T18:42:51Z

Naseido: /* Waiting for response */

'''mClock: Handling Throughput Variability for Hypervisor IO Scheduling'''

=Notes to Group=

=Group Members=
Please leave your name and email address if you are in the group

*[[User:Dagar|Daniel Agar]] - dagar@scs.carleton.ca
*[[User:xchen6|Xi Chen]] - xintai1985@gmail.com
*[[User:npatel1|Niravkumar Patel]] - npatel1@scs.carleton.ca
*[[User:tpham3|Tuan Pham]] - tpham3@scs.carleton.ca
*[[User:aellebla|Aaron Leblanc]] - aellebla@connect.carleton.ca
*[[User:naseido|Nisrin Abou-Seido]] - naseido@connect.carleton.ca

=Layout=
==Paper==
: the paper's title, authors, and their affiliations. Include a link to the paper and any particularly helpful supplementary information.
==Background Concepts==
: Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.
==Research problem==
: What is the research problem being addressed by the paper? How does this problem relate to past related work?
==Contribution==
: What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)
==Critique==
: What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.
==References==
: You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.

COMP 3000 Essay 2 2010 Question 10

2010-11-16T18:41:53Z

Naseido: /* Notes */

mClock: Handling Throughput Variability for Hypervisor IO Scheduling

see discussion.

According to scientists, the Sun is pretty big.<ref>E. Miller, The Sun, (New York: Academic Press, 2005), 23-5.</ref>
The Moon, however, is not so big.<ref>R. Smith, "Size of the Moon", Scientific American, 46 (April 1978): 44-6.</ref>

==Notes==

COMP 3000 Essay 2 2010 Question 10

2010-11-16T16:14:44Z

Naseido:

mClock: Handling Throughput Variability for Hypervisor IO Scheduling

see discussion.

Talk:COMP 3000 Essay 2 2010 Question 10

2010-11-16T15:48:01Z

Naseido: /* Group Members */

'''mClock: Handling Throughput Variability for Hypervisor IO Scheduling'''

=Notes to Group=

=Group Members=
Please leave your name and email address if you are in the group

*[[User:Dagar|Daniel Agar]] - dagar@scs.carleton.ca
*[[User:xchen6|Xi Chen]] - xintai1985@gmail.com
*[[User:npatel1|Niravkumar Patel]] - npatel1@scs.carleton.ca
*[[User:tpham3|Tuan Pham]] - tpham3@scs.carleton.ca
*[[User:aellebla|Aaron Leblanc]] - aellebla@connect.carleton.ca
*[[User:naseido|Nisrin Abou-Seido]] - naseido@connect.carleton.ca

===Waiting for response===
none

=Layout=
==Paper==
: the paper's title, authors, and their affiliations. Include a link to the paper and any particularly helpful supplementary information.
==Background Concepts==
: Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.
==Research problem==
: What is the research problem being addressed by the paper? How does this problem relate to past related work?
==Contribution==
: What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)
==Critique==
: What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.
==References==
: You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.

Talk:COMP 3000 Essay 2 2010 Question 10

2010-11-16T15:47:17Z

Naseido: /* Waiting for response */

'''mClock: Handling Throughput Variability for Hypervisor IO Scheduling'''

=Notes to Group=

=Group Members=
Please leave your name and email address if you are in the group

*[[User:Dagar|Daniel Agar]] - dagar@scs.carleton.ca
*[[User:xchen6|Xi Chen]] - xintai1985@gmail.com
*[[User:npatel1|Niravkumar Patel]] - npatel1@scs.carleton.ca
*[[User:tpham3|Tuan Pham]] - tpham3@scs.carleton.ca
*[[User:aellebla|Aaron Leblanc]] - aellebla@connect.carleton.ca

===Waiting for response===

=Layout=
==Paper==
: the paper's title, authors, and their affiliations. Include a link to the paper and any particularly helpful supplementary information.
==Background Concepts==
: Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.
==Research problem==
: What is the research problem being addressed by the paper? How does this problem relate to past related work?
==Contribution==
: What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)
==Critique==
: What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.
==References==
: You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.

COMP 3000 Essay 1 2010 Question 9

2010-10-15T11:44:29Z

Naseido: /* Introduction */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to tackle the problem of ever increasing storage needs, particularly in a server environment. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It is easier to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the development of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average user today. The average user at the time would not have had many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing and retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
The newest file systems are quickly closing in on ZFS and the forerunner of the pack is currently Btrfs. Designed with similar motivation as ZFS Btrfs provides a similarly rich feature set that is ideally suited for modern computing needs.

====Btrfs====

Btrfs, the B-tree File System, started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS, its main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Btrfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and support of massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types: keys, items and block headers. The data type of these trees is referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks. The first is a 64-bit object id, followed by 8-bits denoting the item’s type, and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive items with interior nodes containing keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data.

As a minimum the Btrfs file system contains three trees. First, it contains a tree that contains other tree roots. Second, it contains a subvolumes tree which holds files and directories and third it contains an extents volume tree that contains information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees that are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal aspect of a number of features of the Btrfs. Writes never occur on the same blocks. A transaction log is created and writes are cached.Then, the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the file system. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Therefore, Btrfs provides a very simple underlying implementation and provides a number of features that help ensure this file system will remain useful in the future.

====WinFS====
WinFS (Windows Future Storage), although now a defunct project as it was cancelled in 2006, was the brainchild of Microsoft and deserves comparison with ZFS. While the other best in class file systems ( most notably Btrfs) all scramble to meet modern computing needs with an almost identical feature set to ZFS, WinFS was attempting a complete rethink of a file system.

The two most notable features of WinFS was its database based design and it's peer-to-peer replication services.

Unlike traditional file systems including cutting edge ZFS, WinFS was moving away from the hierarchical structure file systems regarding search. The file data was still going to reside in a hierarchical structure that was based on NTFS, but there was meta-data stored in a relational database. This would allow searches that transcended normal file meta-data like time stamp, name, text content. Meta-data could be added to any file type and resulting searches would return a collection of file types. Searches like "all the people I have pictures of while I was on my trip in Thailand and whose email address I have" would be understood. The query would return the pictures, email traffic, contact card from organizer software.

Microsoft also recognized the increasing importance of being able to replicate and synchronize data between devices. WinFS from the ground up was designed with the ability to replicate data across thousands of computers. The synchronization algorithms were also intended to resolve conflicts with transparency to the user across potentially unreliable networks.

====Comparison====
Upon first inspection, Btrfs seems near identical to ZFS. However, Btrfs does lack some features of ZFS. First of all, Btrfs doesn't have the self-healing capability or data deduplication of ZFS. Second, ZFS also support more configurations of software RAID than Btrfs. However, since ZFS does have three more years of development it is still very possible for Btrfs to catch up.

WinFS provided an interesting set of features, many of which are completely different than ZFS, because it approached the problem differently. While both were attempting to meet the needs of the modern computer, ZFS was more focused on server settings and WinFS seemed to be focused on the requirements of the home PC user. WinFS was not focused on performance or large memory storage needs and therefore it would not have been able to serve as a useful replacement for traditional file systems.

These two file systems demonstrate that, while at this point no one system is capable of providing the full functionality and capabilities of ZFS, systems such as Btrfs are, at least, coming close. Perhaps someone will also take up WinFS and change it so it meets a wider needs profile. However, if we are dicussing a viable alternative to file systems today, ZFS is currently the best choice.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks and the problems with administrators continuously switching disks, performing partitions, copying old data to the new larger disks there was a significant amount of user downtime. Therefore, the administrators were attracted to the storage pool concept, and to the added flexibility provided by ZFS.

Lastly, given the increasing demands not only on vast amounts of storage, but on the continuous availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities, any inefficiency in traditional file systems will soon, if it hasn't already, become extremely intolerable.

Therefore, ZFS should be seriously considered as a smart file system solution for today and for the foreseeable future.

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx].

*Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx].

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186].

* Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Mason Chris, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

* Novik Lev, Irena Hudis, Douglas B. Terry, Sanjay Anand, Vivek J. Jhaveri, Ashish Shah, Yunxin Wu (2006), [http://research.microsoft.com/pubs/65604/tr-2006-78.pdf "Peer-to-Peer Replication in WinFS"], Microsoft Corporation

* Rector Brent, (2004),[http://msdn.microsoft.com/en-us/library/aa479870.aspx "Chapter 4. Storage"], Wise Owl Consulting

COMP 3000 Essay 1 2010 Question 9

2010-10-15T11:42:46Z

Naseido: /* Comparison */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to ever increasing storage particularly in a server environment. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It is easier to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the development of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average user today. The average user at the time would not have had many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing and retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
The newest file systems are quickly closing in on ZFS and the forerunner of the pack is currently Btrfs. Designed with similar motivation as ZFS Btrfs provides a similarly rich feature set that is ideally suited for modern computing needs.

====Btrfs====

Btrfs, the B-tree File System, started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS, its main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Btrfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and support of massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types: keys, items and block headers. The data type of these trees is referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks. The first is a 64-bit object id, followed by 8-bits denoting the item’s type, and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive items with interior nodes containing keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data.

As a minimum the Btrfs file system contains three trees. First, it contains a tree that contains other tree roots. Second, it contains a subvolumes tree which holds files and directories and third it contains an extents volume tree that contains information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees that are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal aspect of a number of features of the Btrfs. Writes never occur on the same blocks. A transaction log is created and writes are cached.Then, the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the file system. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Therefore, Btrfs provides a very simple underlying implementation and provides a number of features that help ensure this file system will remain useful in the future.

====WinFS====
WinFS (Windows Future Storage), although now a defunct project as it was cancelled in 2006, was the brainchild of Microsoft and deserves comparison with ZFS. While the other best in class file systems ( most notably Btrfs) all scramble to meet modern computing needs with an almost identical feature set to ZFS, WinFS was attempting a complete rethink of a file system.

The two most notable features of WinFS was its database based design and it's peer-to-peer replication services.

Unlike traditional file systems including cutting edge ZFS, WinFS was moving away from the hierarchical structure file systems regarding search. The file data was still going to reside in a hierarchical structure that was based on NTFS, but there was meta-data stored in a relational database. This would allow searches that transcended normal file meta-data like time stamp, name, text content. Meta-data could be added to any file type and resulting searches would return a collection of file types. Searches like "all the people I have pictures of while I was on my trip in Thailand and whose email address I have" would be understood. The query would return the pictures, email traffic, contact card from organizer software.

Microsoft also recognized the increasing importance of being able to replicate and synchronize data between devices. WinFS from the ground up was designed with the ability to replicate data across thousands of computers. The synchronization algorithms were also intended to resolve conflicts with transparency to the user across potentially unreliable networks.

====Comparison====
Upon first inspection, Btrfs seems near identical to ZFS. However, Btrfs does lack some features of ZFS. First of all, Btrfs doesn't have the self-healing capability or data deduplication of ZFS. Second, ZFS also support more configurations of software RAID than Btrfs. However, since ZFS does have three more years of development it is still very possible for Btrfs to catch up.

WinFS provided an interesting set of features, many of which are completely different than ZFS, because it approached the problem differently. While both were attempting to meet the needs of the modern computer, ZFS was more focused on server settings and WinFS seemed to be focused on the requirements of the home PC user. WinFS was not focused on performance or large memory storage needs and therefore it would not have been able to serve as a useful replacement for traditional file systems.

These two file systems demonstrate that, while at this point no one system is capable of providing the full functionality and capabilities of ZFS, systems such as Btrfs are, at least, coming close. Perhaps someone will also take up WinFS and change it so it meets a wider needs profile. However, if we are dicussing a viable alternative to file systems today, ZFS is currently the best choice.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks and the problems with administrators continuously switching disks, performing partitions, copying old data to the new larger disks there was a significant amount of user downtime. Therefore, the administrators were attracted to the storage pool concept, and to the added flexibility provided by ZFS.

Lastly, given the increasing demands not only on vast amounts of storage, but on the continuous availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities, any inefficiency in traditional file systems will soon, if it hasn't already, become extremely intolerable.

Therefore, ZFS should be seriously considered as a smart file system solution for today and for the foreseeable future.

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx].

*Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx].

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186].

* Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Mason Chris, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

* Novik Lev, Irena Hudis, Douglas B. Terry, Sanjay Anand, Vivek J. Jhaveri, Ashish Shah, Yunxin Wu (2006), [http://research.microsoft.com/pubs/65604/tr-2006-78.pdf "Peer-to-Peer Replication in WinFS"], Microsoft Corporation

* Rector Brent, (2004),[http://msdn.microsoft.com/en-us/library/aa479870.aspx "Chapter 4. Storage"], Wise Owl Consulting

COMP 3000 Essay 1 2010 Question 9

2010-10-15T11:40:11Z

Naseido: /* Comparison */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to ever increasing storage particularly in a server environment. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It is easier to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the development of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average user today. The average user at the time would not have had many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing and retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
The newest file systems are quickly closing in on ZFS and the forerunner of the pack is currently Btrfs. Designed with similar motivation as ZFS Btrfs provides a similarly rich feature set that is ideally suited for modern computing needs.

====Btrfs====

Btrfs, the B-tree File System, started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS, its main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Btrfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and support of massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types: keys, items and block headers. The data type of these trees is referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks. The first is a 64-bit object id, followed by 8-bits denoting the item’s type, and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive items with interior nodes containing keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data.

As a minimum the Btrfs file system contains three trees. First, it contains a tree that contains other tree roots. Second, it contains a subvolumes tree which holds files and directories and third it contains an extents volume tree that contains information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees that are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal aspect of a number of features of the Btrfs. Writes never occur on the same blocks. A transaction log is created and writes are cached.Then, the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the file system. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Therefore, Btrfs provides a very simple underlying implementation and provides a number of features that help ensure this file system will remain useful in the future.

====WinFS====
WinFS (Windows Future Storage), although now a defunct project as it was cancelled in 2006, was the brainchild of Microsoft and deserves comparison with ZFS. While the other best in class file systems ( most notably Btrfs) all scramble to meet modern computing needs with an almost identical feature set to ZFS, WinFS was attempting a complete rethink of a file system.

The two most notable features of WinFS was its database based design and it's peer-to-peer replication services.

Unlike traditional file systems including cutting edge ZFS, WinFS was moving away from the hierarchical structure file systems regarding search. The file data was still going to reside in a hierarchical structure that was based on NTFS, but there was meta-data stored in a relational database. This would allow searches that transcended normal file meta-data like time stamp, name, text content. Meta-data could be added to any file type and resulting searches would return a collection of file types. Searches like "all the people I have pictures of while I was on my trip in Thailand and whose email address I have" would be understood. The query would return the pictures, email traffic, contact card from organizer software.

Microsoft also recognized the increasing importance of being able to replicate and synchronize data between devices. WinFS from the ground up was designed with the ability to replicate data across thousands of computers. The synchronization algorithms were also intended to resolve conflicts with transparency to the user across potentially unreliable networks.

====Comparison====
Upon first inspection, Btrfs seems near identical to ZFS. However, Btrfs does lack some a features of ZFS. First of all, Btrfs doesn't have the self-healing capability or data deduplication of ZFS. Second, ZFS also support more configurations of software RAID than Btrfs. However, since ZFS does have three more years of development it is still very possible for Btrfs to catch up.

WinFS provided an interesting set of features, many of which are completely different than ZFS, because it approached the problem differently. While both were attempting to meet the needs of the modern computer, ZFS was more focused on server settings and WinFS seemed to be focused on the requirements of the home PC user. WinFS was not focused on performance or large memory storage needs and therefore it would not have been able to serve as a useful replacement for traditional file systems.

These two file systems demonstrate that, while at this point no one system is capable of providing the full functionality and capabilities of ZFS, systems such as Btrfs are, at least, coming close. Perhaps someone will also take up WinFS and change it so it meets a wider needs profile. However, if we are dicussing a viable alternative to file systems today, ZFS is currently the best choice.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks and the problems with administrators continuously switching disks, performing partitions, copying old data to the new larger disks there was a significant amount of user downtime. Therefore, the administrators were attracted to the storage pool concept, and to the added flexibility provided by ZFS.

Lastly, given the increasing demands not only on vast amounts of storage, but on the continuous availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities, any inefficiency in traditional file systems will soon, if it hasn't already, become extremely intolerable.

Therefore, ZFS should be seriously considered as a smart file system solution for today and for the foreseeable future.

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx].

*Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx].

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186].

* Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Mason Chris, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

* Novik Lev, Irena Hudis, Douglas B. Terry, Sanjay Anand, Vivek J. Jhaveri, Ashish Shah, Yunxin Wu (2006), [http://research.microsoft.com/pubs/65604/tr-2006-78.pdf "Peer-to-Peer Replication in WinFS"], Microsoft Corporation

* Rector Brent, (2004),[http://msdn.microsoft.com/en-us/library/aa479870.aspx "Chapter 4. Storage"], Wise Owl Consulting

COMP 3000 Essay 1 2010 Question 9

2010-10-15T11:38:38Z

Naseido: /* Comparison */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to ever increasing storage particularly in a server environment. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It is easier to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the development of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average user today. The average user at the time would not have had many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing and retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
The newest file systems are quickly closing in on ZFS and the forerunner of the pack is currently Btrfs. Designed with similar motivation as ZFS Btrfs provides a similarly rich feature set that is ideally suited for modern computing needs.

====Btrfs====

Btrfs, the B-tree File System, started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS, its main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Btrfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and support of massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types: keys, items and block headers. The data type of these trees is referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks. The first is a 64-bit object id, followed by 8-bits denoting the item’s type, and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive items with interior nodes containing keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data.

As a minimum the Btrfs file system contains three trees. First, it contains a tree that contains other tree roots. Second, it contains a subvolumes tree which holds files and directories and third it contains an extents volume tree that contains information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees that are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal aspect of a number of features of the Btrfs. Writes never occur on the same blocks. A transaction log is created and writes are cached.Then, the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the file system. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Therefore, Btrfs provides a very simple underlying implementation and provides a number of features that help ensure this file system will remain useful in the future.

====WinFS====
WinFS (Windows Future Storage), although now a defunct project as it was cancelled in 2006, was the brainchild of Microsoft and deserves comparison with ZFS. While the other best in class file systems ( most notably Btrfs) all scramble to meet modern computing needs with an almost identical feature set to ZFS, WinFS was attempting a complete rethink of a file system.

The two most notable features of WinFS was its database based design and it's peer-to-peer replication services.

Unlike traditional file systems including cutting edge ZFS, WinFS was moving away from the hierarchical structure file systems regarding search. The file data was still going to reside in a hierarchical structure that was based on NTFS, but there was meta-data stored in a relational database. This would allow searches that transcended normal file meta-data like time stamp, name, text content. Meta-data could be added to any file type and resulting searches would return a collection of file types. Searches like "all the people I have pictures of while I was on my trip in Thailand and whose email address I have" would be understood. The query would return the pictures, email traffic, contact card from organizer software.

Microsoft also recognized the increasing importance of being able to replicate and synchronize data between devices. WinFS from the ground up was designed with the ability to replicate data across thousands of computers. The synchronization algorithms were also intended to resolve conflicts with transparency to the user across potentially unreliable networks.

====Comparison====
Upon first inspection, Btrfs seems near identical to ZFS. However, Btrfs does lack some a features of ZFS. First of all, Btrfs doesn't have the self-healing capability or data deduplication of ZFS. Second, ZFS also support more configurations of software RAID than Btrfs. However, since ZFS does have three more years of development it is still very possible for Btrfs to catch up.

WinFS provided an interesting set of features, many of which are completely different than ZFS, because it approached the problem differently. While both were attempting to meet the needs of modern computer, ZFS is more focused on server settings and WinFS seemed to be focusing on the requirements of the home PC user. WinFS was not focused on performance or large memory storage needs and therefore it would not have been able to serve as a useful replacement for traditional file systems.

These two file systems demonstrate that, while at this point no one system is capable of providing the full functionality and capabilities of ZFS, systems such as Btrfs are, at least, coming close. Perhaps someone will also take up WinFS and change it so it meets a wider needs profile. However, if we are dicussing a viable alternative to file systems today, ZFS is currently the best choice.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks and the problems with administrators continuously switching disks, performing partitions, copying old data to the new larger disks there was a significant amount of user downtime. Therefore, the administrators were attracted to the storage pool concept, and to the added flexibility provided by ZFS.

Lastly, given the increasing demands not only on vast amounts of storage, but on the continuous availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities, any inefficiency in traditional file systems will soon, if it hasn't already, become extremely intolerable.

Therefore, ZFS should be seriously considered as a smart file system solution for today and for the foreseeable future.

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx].

*Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx].

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186].

* Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Mason Chris, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

* Novik Lev, Irena Hudis, Douglas B. Terry, Sanjay Anand, Vivek J. Jhaveri, Ashish Shah, Yunxin Wu (2006), [http://research.microsoft.com/pubs/65604/tr-2006-78.pdf "Peer-to-Peer Replication in WinFS"], Microsoft Corporation

* Rector Brent, (2004),[http://msdn.microsoft.com/en-us/library/aa479870.aspx "Chapter 4. Storage"], Wise Owl Consulting

COMP 3000 Essay 1 2010 Question 9

2010-10-15T11:18:09Z

Naseido: /* WinFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to ever increasing storage particularly in a server environment. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It is easier to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the development of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average user today. The average user at the time would not have had many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing and retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
The newest file systems are quickly closing in on ZFS and the forerunner of the pack is currently Btrfs. Designed with similar motivation as ZFS Btrfs provides a similarly rich feature set that is ideally suited for modern computing needs.

====Btrfs====

Btrfs, the B-tree File System, started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS, its main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Btrfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and support of massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types: keys, items and block headers. The data type of these trees is referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks. The first is a 64-bit object id, followed by 8-bits denoting the item’s type, and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive items with interior nodes containing keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data.

As a minimum the Btrfs file system contains three trees. First, it contains a tree that contains other tree roots. Second, it contains a subvolumes tree which holds files and directories and third it contains an extents volume tree that contains information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees that are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal aspect of a number of features of the Btrfs. Writes never occur on the same blocks. A transaction log is created and writes are cached.Then, the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the file system. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Therefore, Btrfs provides a very simple underlying implementation and provides a number of features that help ensure this file system will remain useful in the future.

====WinFS====
WinFS (Windows Future Storage), although now a defunct project as it was cancelled in 2006, was the brainchild of Microsoft and deserves comparison with ZFS. While the other best in class file systems ( most notably Btrfs) all scramble to meet modern computing needs with an almost identical feature set to ZFS, WinFS was attempting a complete rethink of a file system.

The two most notable features of WinFS was its database based design and it's peer-to-peer replication services.

Unlike traditional file systems including cutting edge ZFS, WinFS was moving away from the hierarchical structure file systems regarding search. The file data was still going to reside in a hierarchical structure that was based on NTFS, but there was meta-data stored in a relational database. This would allow searches that transcended normal file meta-data like time stamp, name, text content. Meta-data could be added to any file type and resulting searches would return a collection of file types. Searches like "all the people I have pictures of while I was on my trip in Thailand and whose email address I have" would be understood. The query would return the pictures, email traffic, contact card from organizer software.

Microsoft also recognized the increasing importance of being able to replicate and synchronize data between devices. WinFS from the ground up was designed with the ability to replicate data across thousands of computers. The synchronization algorithms were also intended to resolve conflicts with transparency to the user across potentially unreliable networks.

====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features that ZFS does have. Btrfs doesn't have the self-healing capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs] ZFS does have three more years of development than Btrfs so Btrfs may very well catch up.

WinFS while vapourware did provide an interesting set of features may of which are completely different than ZFS. One reason for this is while both were attempting to meet the needs of modern computer ZFS is more focused on server settings and WinFS seemed to be focusing on the requirement of the home PC user.

These two file systems show that while ZFS it could be supplanted. Btrfs is close to matching the feature set at which point it will be solely down to speed. WinFS proposed different way of storing and thinking about data and some new comer may pick up the torch.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks and the problems with administrators continuously switching disks, performing partitions, copying old data to the new larger disks there was a significant amount of user downtime. Therefore, the administrators were attracted to the storage pool concept, and to the added flexibility provided by ZFS.

Lastly, given the increasing demands not only on vast amounts of storage, but on the continuous availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities, any inefficiency in traditional file systems will soon, if it hasn't already, become extremely intolerable.

Therefore, ZFS should be seriously considered as a smart file system solution for today and for the foreseeable future.

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx].

*Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx].

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186].

* Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Mason Chris, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

* Novik Lev, Irena Hudis, Douglas B. Terry, Sanjay Anand, Vivek J. Jhaveri, Ashish Shah, Yunxin Wu (2006), [http://research.microsoft.com/pubs/65604/tr-2006-78.pdf "Peer-to-Peer Replication in WinFS"], Microsoft Corporation

* Rector Brent, (2004),[http://msdn.microsoft.com/en-us/library/aa479870.aspx "Chapter 4. Storage"], Wise Owl Consulting

COMP 3000 Essay 1 2010 Question 9

2010-10-15T08:31:13Z

Naseido: /* BTRFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to ever increasing storage particularly in a server environment. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It is easier to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the development of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average user today. The average user at the time would not have had many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing an retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
The newest file systems quickly closing in on ZFS. The forerunner of the pack is currently Btrfs. Designed with similar motivation as ZFS Btrfs provides a similarly rich feature set that is ideally suited for modern computing needs.

====BTRFS====

Btrfs, the B-tree File System, started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS, its main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Btrfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and support of massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types: keys, items and block headers. The data type of these trees is referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks. The first is a 64-bit object id, followed by 8-bits denoting the item’s type, and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive items with interior nodes containing keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data.

As a minimum the Btrfs file system contains three trees. First, it contains a tree that contains other tree roots. Second, it contains a subvolumes tree which holds files and directories and third it contains an extents volume tree that contains information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees that are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal aspect of a number of features of the Btrfs. Writes never occur on the same blocks. A transaction log is created and writes are cached.Then, the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the file system. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Therefore, Btrfs provides a very simple underlying implementation and provides a number of features that help ensure this file system will remain useful in the future.

====WinFS====
WinFS (Windows Future Storage) is a now defunct project it closed it's doors in 2006. It was the brainchild of Microsoft and deserves comparison with ZFS. While the other best in class file systems ( most notably Btrfs) all scramble to meet modern computing needs with an almost identical feature set to ZFS, WinFS was attempting a complete rethink of a file system.

The two most notable features of WinFS was it's

WinFS while vapourware did provide an interesting set of features may of which are completely different than ZFS. One reason for this is while both were attempting to meet the needs of modern computer ZFS is more focused on server settings and WinFS seemed to be focusing on the requirement of the home PC user.

====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features that ZFS does have. Btrfs doesn't have the self-healing capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs] ZFS does have three more years of development than Btrfs so Btrfs may very well catch up.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks and the problems with administrators continuously switching disks, performing partitions, copying old data to the new larger disks there was a significant amount of user downtime. Therefore, the administrators were attracted to the storage pool concept, and to the added flexibility provided by ZFS.

Lastly, given the increasing demands not only on vast amounts of storage, but on the continuous availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities, any inefficiency in traditional file systems will soon, if it hasn't already, become extremely intolerable.

Therefore, ZFS should be seriously considered as a smart file system solution for today and for the foreseeable future.

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx].

*Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx].

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186].

* Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Mason Chris, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

* Novik Lev, Irena Hudis, Douglas B. Terry, Sanjay Anand, Vivek J. Jhaveri, Ashish Shah, Yunxin Wu (2006), [http://research.microsoft.com/pubs/65604/tr-2006-78.pdf "Peer-to-Peer Replication in WinFS"], Microsoft Corporation

Talk:COMP 3000 Essay 1 2010 Question 9

2010-10-15T08:29:52Z

Naseido: /* Deadline */

== Contacts / If interested ==
Tawfic : tfatah@gmail.com

Andy Zemancik: andy.zemancik@gmail.com

Lester Mundt: lmundt@gmail.com

Matthew Chou : mateh.cc@gmail.com (this is mchou2)

Nisrin Abou-Seido: naseido@connect.carleton.ca

== Suggested References Format ==
Author, publisher/university, Name of the article

== Who is doing what ==
Suggestion: In order to avoid duplication. Please state what section/item you're currently working on.

Azemanci: Currently working on Section Three Current File Systems.

Nisrin (naseido): Working on intro, conclusion and editing

Lester: Working on section 3 BTRFS and WinFS

Tawfic: added a conclusion. Not planning to add anything else. It's sleep time !!

KEY ISSUE: We need a thesis statement. Please suggest ideas.

--[[User:Tafatah|Tafatah]] 01:38, 15 October 2010 (UTC) I think the thesis statement is implied in the current intro, i.e. ".. of avoiding some of the major problems associated
with traditional file systems . . " suggests that it's unacceptable anymore to tolerate problems in today's IT environment. If
you'd like to expand on it, you could mention the continuous need for flexibility vis-a-vis using data. Example: there's a growing
trend with cloud computing (the marketable name for distributed computing). Users who will opt for that option will have to trust
the host companies with their data. It won't be acceptable to them to be told that a file here or a directory there was lost. The
issue with the growth of smart phones and yet to proliferate tablets also increases the demands for flexibility ( need to be able
to add/shrink/manage storage on the fly and avoid manual intervention when problems arise) . . etc. Hope that helps.

The conclusion would have to assert the main points regarding ZFS, i.e. it's modularization and administrative simplicity, it's
virtualization of storage via the use of SPA and DMU, and it's self healing abilities. I am currently working on the section
that talks ( briefly ) about ZFS's self-healing ( for lack of better words ). So I'll be online for sometime. The info on SPA
amd DMU is already there. If it's not clear enough, please let me know

--[[User:Lmundt|Lmundt]] 03:05, 15 October 2010 (UTC) I agree completely with the thesis and mentioned this at the bottom of the page.

Something along the lines of "ZFS is a file system designed to support the changing requirements of computing" as I had mentioned before "server needs was of particular attention" then continue to expand by describing the environment that is causing these changes "more companies with distributed large scale data storage "the cloud" " as Tawfic has suggested this establishes motivation.

Then we mention the goals that were desired extensibility( accomplished through modularization ), reliability ( checksums, copy on write ) and maintainability ( administrative simplicity)

In the intro to ZFS we talk about how it's feature set supports the design goals. Tawfic has then done most of the feature descriptions.

The section on legacy filesystems should have a mini-intro talking about the state of the enivronment they were designed for and there goals of the time"

Describe then which has been done.

Small contrast and compare with ZFS generally summarizing along the lines of "gee that sure is better"

Repeat for the other two sections current and future with each contrast/compare getting larger since they more comparable and with different conclusions.

--[[User:Lmundt|Lmundt]] 07:34, 15 October 2010 (UTC) Personally I am not certain if an example should be in the conclusion.... I like the opening lines though. Kind of disregarding the rest of the essay though I think.

--[[User:Lmundt|Lmundt]] 08:15, 15 October 2010 (UTC)
I think that conclusion is looking more focussed great job.

== Deadline ==
Suggestion: Adding content should stop on Thursday, October 14'th at 3:00 PM. Any work after that
should go into formatting, spelling, and grammar checking.

--[[User:Lmundt|Lmundt]] 15:00, 14 October 2010 (UTC)
- I will definitely be adding content after this time probably late, late into the evening.

--[[User:Tafatah|Tafatah]] 19:25, 14 October 2010 (UTC) No problem. Forget about the suggested deadline. I thought we'd have to be done by 11:00Pm.
I am still adding stuff myself. I think Anil will lock the Wiki around 7:00 Am or so. So anytime
before that is Ok.

--[[User:Lmundt|Lmundt]] 07:22, 15 October 2010 (UTC)
I wish I could have got this started earlier but 4104 had a crazy assignment that destroyed me since Saturday.

--[[User:Naseido|Naseido]] 4:25 EST, 15 October 2010
Okay. So, as of now everything has been edited for grammar and spelling and format and I think its looking good. Only issue is someone is adding to the WinFS section right now so I can't edit it. My suggestion is to '''scrap that section''' because its not even really related ( ie its not a traditional file system). anyway, if you want to go on with it go ahead but make sure you edit for grammar,spelling.. and also edit the conclusion. Unfortunately, i'm way too tired to stay up and wait till its done to do the editing myself..sorry.

== Essay Format Take 2 ==
Hello. I am suggesting the following format instead. If you agree, I'll take care of merging the existing info into this new format. My feeling is that this format is
more flexible and will (hopefully) allow individuals to take a section or a sub-section and work on it.

* '''Abstract'''
TO-DO: Main point. Current File Systems are neither versatile enough nor intelligent to handle the rapidly
growing needs of dynamic storage.

TO-DO: few statements regarding the WHYS as to the need for versatile storage (e.g. cloud computing, mobile environments, shifting consumer
demand . . etc )

TO-DO: few statements regarding the need for intelligence (just statements, the body will take care of expanding on these ). E.g. more
intelligent FS’s can include Metadata to help crime investigators, smart FS’s could be self healing . . .etc.

* '''Traditional File Systems'''
** '''Characteristics'''
** '''Limitations'''

* '''Zettabyte File System'''
** '''Characteristics'''
** '''Dissected'''
TO-DO: List the seven components of ZFS and basically what makes a ZFS
E.g. interface, various parts, and external needed libraries . . etc.

** '''Features Beyond Traditional File Systems'''

** '''Possible Real-Life Scenarios / Examples'''
TO-DO: 2-3 examples where ZFS was/could/is being considered for use.

TO-DO : One to two paragraphs stressing / reiterating the main points made in the abstract
thesis statement).

* '''Alternatives to ZFS'''
one example is good enough.
TO-DO: a brief description of the alternative.
Main argument for it’s viability.

** '''Pros/Cons'''
TO-DO: just a list of pluses and minuses

TO-DO : two to three paragraphs summarizing (this is the conclusion) the main points outlined in the abstract and the body, restating why traditional
FS’s are no longer viable, and stressing once more that ZFS is a valid alternative.

== Essay Format ==

I started working on the main page. The bullets are to be expanded. Other group are are working in their respective discussion pages but I think it's all right to put our work in progress on the front page. Thoughts?--[[User:Lmundt|Lmundt]] 16:14, 6 October 2010 (UTC)
* [[User:Gbint|Gbint]] 02:03, 7 October 2010 (UTC) Lmundt; what do you think of listing the capacities of the file system under major features? I was thinking that we could overview the features in brief, then delve into each one individually.
* --[[User:Lmundt|Lmundt]] 14:31, 7 October 2010 (UTC) I was thinking about the major structure... I like what your suggesting in one section. So here is the structure I am thinking of.

* Intro
* Section One ZFS
** Major feature 1
** Major feature 2
** Major feature 3
* Section Two Legacy File Systems
** Legacy File System1( FAT32 ) - what it does
** Legacy File System2( ext2 ) - what it does
** Contrast them with ZFS
* Section Three Current File Systems
** NTFS?
** ext4?
** Contrast them with ZFS
* Section Four future file Systems
** BTRFS
** WinFS or ??
** Contrast them with ZFS
* Conclusion

What does everyone think of this format? While everyone should contribute to section one we could divvy up the rest.

[[User:Gbint|Gbint]] 16:29, 9 October 2010 (UTC) The layout looks good; I filled out the data dedup section. I think it has reasonable coverage while staying away from becoming it's own essay just on deduplication.

The legacy file systems are really not even in the same world as ZFS, so I think the contrasting section should cover a lot of how storage needs have changed.

The current file systems are capable of being expanded into large pools of storage with good redundancy and even advanced features like data deduplication, but they are only a component in a chain of tools (like ext4 + lvm + mdraid + opendedup) rather than an full end-to-end solution.

--[[User:Lmundt|Lmundt]] 23:35, 9 October 2010 (UTC) The section on deduplication looks good I agree it looks like the right amount of coverage for a portion of a single section. Your also right about the old file systems not being able to hold a candle to ZFS and the conclusion section should talk about how storage needs and computers changed. And intro to that section could set the stage for that period as well. Non-multi-threaded, single processor system with much smaller RAM, even the applications were radically different the Internet was just single webpages without the high performance needs of web commerce and online banking for example. I have another assignment so won't be contributing too much until Monday.

--[[User:Tafatah|Tafatah]] 23:54, 10 October 2010 (UTC)
Please take a look at suggested essay format #2 and let me know soon. Time is running out Gents and Ladies :)

--[[User:Lmundt|Lmundt]] 15:35, 11 October 2010 (UTC)
I think I prefer the outline I proposed only because it's a very regimented contrast/compare essay format and should get us any marks for format. Most proper essays don't usually have a dedicated pros cons list. Heading more towards a report format I think. It's really what everyone agrees on. I won't be touching the essay until tomorrow though.

--[[User:Azemanci|Azemanci]] 17:32, 11 October 2010 (UTC)
I like Lmundt's outline. How would you like to divide up the work? Also can everyone post the contact information so we know exactly who is in our group.

--[[User:Tafatah|Tafatah]] 19:03, 11 October 2010 (UTC)
No problem, I'll go with the current format. One issue to keep in mind is that this is an essay, not a report. I.E. the intro/thesis has to include
a reasonable suggestion towards using ZFS as a reliable FS. The body and the conclusion would have to assert that. The current format satisfies that
if we keep these points in mind. I started looking into the "dissect subsection" in the format I suggested, which is related to the ZFS features
section one in the current format. I'll continue to look into that part (above section, who is doing what will be updated accordingly), i.e. I'll
take care of section one since I've already done some work on it. I suggest that each member of the group picks two items from one of the other
sections, except the contrasting part. Content in section one can then be used to finalize the comparisons in each of sections 2-4. The Intro/Abstract
and conclusion sections can be left to the end, and can be done collaboratively. I.E. once we have a very clear picture of all the
different pieces.

--[[User:Azemanci|Azemanci]] 03:18, 12 October 2010 (UTC)
I will begin working on section three current File Systems unless someone else has already begun working on it.

--[[User:Mchou2|Mchou2]] 20:29, 12 October 2010 (UTC)
I am going to start researching for section 2.

--[[User:Azemanci|Azemanci]] 03:15, 13 October 2010 (UTC) Alright so all the sections are being taken care of so we should be good to go for Thursday.

--[[User:Tafatah|Tafatah]] 04:35, 13 October 2010 (UTC) '''No one is assigned to section four''' ? Also, for those who haven't picked any section or subsection, please help out with the sections you're
more familiar with.

Finally, if you were in class today (well, technically yesterday), then you've heard Anil talk about plagiarism. I know this is common knowledge, so forgive
the annoying reminder. Please never copy and paste, and make sure to cite your info. As Anil mentioned, if anyone plagiarises, we are ALL responsible. It is
simply impossible for the rest of the group to check whether every member's sentence is genuine or not. So use your own words/phrases ( doesn't
have to be fancy or sophisticated ). If you're not sure, please check with the rest of the group.

Good luck, and good night.
--Tawfic

--[[User:Azemanci|Azemanci]] 14:55, 13 October 2010 (UTC) My bad I misread something I thought you were doing current file systems section 3. I'll take section 3 but then someone needs to do section 4. There are 4 of us so this should not be a problem.

--[[User:Naseido|Naseido]] 13 October 2010 Sorry I haven't contributed till now. The outline looks great and I think we can spend most of the day tomorrow editing to make sure all the sections fit together like an essay. I'll be doing section 4.

--[[User:Tafatah|Tafatah]] 16:11, 13 October 2010 (UTC) Hi. In section 4 the most important one is BTRFS. More info on that and less info on the others is Ok.

--[[User:Mchou2|Mchou2]] 03:00, 14 October 2010 (UTC)
I have done what I can for the legacy file systems, if someone who doesn't have any particular job wouldn't mind going over it and correcting any errors they see. I am also not familiar with how to edit/format these wiki pages so I tried my best and if you want to change the layout then please do, I would assume after we complete our sections and collaborate them into 1 essay that the formatting will change. I simply put headings on each section just so it is easier to read.

--[[User:Tafatah|Tafatah]] 04:55, 14 October 2010 (UTC) A reference for wiki editing http://meta.wikimedia.org/wiki/Help:Editing

--[[User:Azemanci|Azemanci]] 18:42, 14 October 2010 (UTC) I'm not going to have my info posted by 3:00. Also how and where are we supposed to cite our sources?

--[[User:Tafatah|Tafatah]] 19:28, 14 October 2010 (UTC) No worries. You have till 7:00 Am ( or till Anil locks the Wiki down, though I wouldn't count on more than 7) Friday, Oct 15. For citing, I am using
this convention. Bla.....Bla [Z1. P3] means I am using info from page 3 of article labeled as Z1 in references section.

--[[User:Lmundt|Lmundt]] 20:48, 14 October 2010 (UTC) Citation reference looks good. I think in each section we should talk about the motivation behind the design. A small into for each section. For example in the legacy area fat32 and ext, we could talk about computing becoming commonplace for the average worker so most development was for single users with a relatively small hardrive, not too many files. It will in-turn lead to justify/explanation for the design decission for each vintage of OS.
The main intro should about the motivations behind the design. Intended for servers with a focus or expandability, reliability and self maintenance. This is the motivation behind all those cool features we are detailing.

== Sources ==

Not from your group. Found a file which goes to the heart of your problem
[http://www.oracle.com/technetwork/server-storage/solaris/overview/zfs-14990
2.pdf ZFSDatasheet]
[[User:Gautam|Gautam]] 22:50, 5 October 2010 (UTC)

Thanks will take a look at that.--[[User:Lmundt|Lmundt]] 16:12, 6 October 2010 (UTC)

[[User:Gbint|Gbint]] 01:45, 7 October 2010 (UTC) paper from Sun engineers explaining why they came to build ZFS, the problems they wanted to solve:
* PDF: http://www.timwort.org/classp/200_HTML/docs/zfs_wp.pdf
* HTML: http://74.125.155.132/scholar?q=cache:6Ex3KbFo4lYJ:scholar.google.com/+zettabyte+file+system&hl=en&as_sdt=2000

Excellent article.[[User:Lmundt|Lmundt]] 14:24, 7 October 2010 (UTC)

Not too exciting but it looks like an easy read http://arstechnica.com/hardware/news/2008/03/past-present-future-file-systems.ars [[User:Lmundt|Lmundt]] 14:40, 7 October 2010 (UTC)

the [http://en.wikipedia.org/wiki/Comparison_of_file_systems wikipedia comparison] has some good tables, and if you click the various categories you can learn quite a bit about the various important features //not your group. [[User:Rift|Rift]] 18:56, 7 October 2010 (UTC)

Hey, I'm not from your group but I found this slideshow that was really handy in the assignment! http://www.slideshare.net/Clogeny/zfs-the-last-word-in-filesystems - nshires

------

Hey there. I'm not a member of your group. But you guys might want to look at this Wiki-page from the SolarisInternals website. I used it today for our assignment, a lot of interesting and in-depth breakdown of the ZFS file system: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Performance_Considerations

-- Munther

--[[User:Mchou2|Mchou2]] 03:56, 13 October 2010 (UTC) Good intro to understanding FAT FS
http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf

--[[User:Azemanci|Azemanci]] 18:49, 14 October 2010 (UTC)
Abit late but I found a comparison of current file systems including ZFS:
http://www.idt.mdh.se/kurser/ct3340/ht09/ADMINISTRATION/IRCSE09-submissions/ircse09_submission_16.pdf

http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-15T08:23:53Z

Naseido: /* BTRFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to ever increasing storage particularly in a server environment. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It is easier to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the development of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average user today. The average user at the time would not have had many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing an retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
The newest file systems quickly closing in on ZFS. The forerunner of the pack is currently Btrfs. Designed with similar motivation as ZFS Btrfs provides a similarly rich feature set that is ideally suited for modern computing needs.

====BTRFS====

Btrfs, the B-tree File System, started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS, its main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Btrfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and support of massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types: keys, items and block headers. The data type of these trees is referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks. The first is a 64-bit object id, followed by 8-bits denoting the item’s type, and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive items with interior nodes containing keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data.

As a minimum the Btrfs file system contains three trees. First, it contains a tree that contains other tree roots. Second, it contains a subvolumes tree which holds files and directories and third it contains an extents volume tree that contains information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees that are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal aspect of a number of features of the Btrfs. Writes never occur on the same blocks. A transaction log is created and writes are cached.Then, the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the file system. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Bdfs provides a very simple underlying implementation ( algorithms non-withstanding ) and provides a number of features that help ensure this file system is useful in the future.

====WinFS====
WinFS (Windows Future Storage) is a now defunct project it closed it's doors in 2006. It was the brainchild of Microsoft and deserves comparison with ZFS. While the other best in class file systems ( most notably Btrfs) all scramble to meet modern computing needs with an almost identical feature set to ZFS, WinFS was attempting a complete rethink of a file system.

The two most notable features of WinFS was it's

WinFS while vapourware did provide an interesting set of features may of which are completely different than ZFS. One reason for this is while both were attempting to meet the needs of modern computer ZFS is more focused on server settings and WinFS seemed to be focusing on the requirement of the home PC user.

====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features that ZFS does have. Btrfs doesn't have the self-healing capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs] ZFS does have three more years of development than Btrfs so Btrfs may very well catch up.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks and the problems with administrators continuously switching disks, performing partitions, copying old data to the new larger disks there was a significant amount of user downtime. Therefore, the administrators were attracted to the storage pool concept, and to the added flexibility provided by ZFS.

Lastly, given the increasing demands not only on vast amounts of storage, but on the continuous availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities, any inefficiency in traditional file systems will soon, if it hasn't already, become extremely intolerable.

Therefore, ZFS should be seriously considered as a smart file system solution for today and for the foreseeable future.

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx].

*Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx].

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186].

* Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

* (2006), [http://research.microsoft.com/pubs/65604/tr-2006-78.pdf "Peer-to-Peer Replication in WinFS"], Microsoft Corporation

COMP 3000 Essay 1 2010 Question 9

2010-10-15T08:21:31Z

Naseido: /* BTRFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to ever increasing storage particularly in a server environment. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It is easier to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the development of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average user today. The average user at the time would not have had many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing an retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
The newest file systems quickly closing in on ZFS. The forerunner of the pack is currently Btrfs. Designed with similar motivation as ZFS Btrfs provides a similarly rich feature set that is ideally suited for modern computing needs.

====BTRFS====

Btrfs, the B-tree File System, started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS, its main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Btrfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and support of massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types: keys, items and block headers. The data type of these trees is referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks. The first is a 64-bit object id, followed by 8-bits denoting the item’s type, and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive items with interior nodes containing keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data.

As a minimum the Btrfs file system contains three trees. First, it contains a tree that contains other tree roots. Second, it a subvolumes tree which holds file and directories. An extents volume that contains the information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal piece of a number of features of the Bdfs. Writes never occur on the same blocks. A transaction log is created and writes are cached the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the filesystem. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Bdfs provides a very simple underlying implementation ( algorithms non-withstanding ) and provides a number of features that help future proof this file system.

====WinFS====
WinFS (Windows Future Storage) is a now defunct project it closed it's doors in 2006. It was the brainchild of Microsoft and deserves comparison with ZFS. While the other best in class file systems ( most notably Btrfs) all scramble to meet modern computing needs with an almost identical feature set to ZFS, WinFS was attempting a complete rethink of a file system.

The two most notable features of WinFS was it's

WinFS while vapourware did provide an interesting set of features may of which are completely different than ZFS. One reason for this is while both were attempting to meet the needs of modern computer ZFS is more focused on server settings and WinFS seemed to be focusing on the requirement of the home PC user.

====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features that ZFS does have. Btrfs doesn't have the self-healing capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs] ZFS does have three more years of development than Btrfs so Btrfs may very well catch up.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks and the problems with administrators continuously switching disks, performing partitions, copying old data to the new larger disks there was a significant amount of user downtime. Therefore, the administrators were attracted to the storage pool concept, and to the added flexibility provided by ZFS.

Lastly, given the increasing demands not only on vast amounts of storage, but on the continuous availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities, any inefficiency in traditional file systems will soon, if it hasn't already, become extremely intolerable.

Therefore, ZFS should be seriously considered as a smart file system solution for today and for the foreseeable future.

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx].

*Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx].

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186].

* Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T08:04:21Z

Naseido: /* Legacy File Systems */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It is easier to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the development of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average user today. The average user at the time would not have had many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing an retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
The newest file systems quickly closing in on ZFS. The forerunner of the pack is currently Btrfs. Designed with similar motivation as ZFS Btrfs provides a similarly rich feature set that is ideally suited for modern computing needs.

====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types keys, items and block headers. The data type of these trees are referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks the fist is a 64-bit object id followed by 8-bits denoting the item’s type and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive with interior nodes containing on keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data
.
As a minimum the Btrfs filesystem will contain three trees a tree that has contains a other tree roots. A subvolumes which holds file and directories. An extents volume that contains the information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal piece of a number of features of the Bdfs. Writes never occur on the same blocks. A transaction log is created and writes are cached the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the filesystem. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Bdfs provides a very simple underlying implementation ( algorithms non-withstanding ) and provides a number of features that help future proof this file system.

====WinFS====
====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features that ZFS does have. Btrfs doesn't have the self-healing capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs] ZFS does have three more years of development than Btrfs so Btrfs may very well catch up.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks and the problems with administrators continuously switching disks, performing partitions, copying old data to the new larger disks there was a significant amount of user downtime. Therefore, the administrators were attracted to the storage pool concept, and to the added flexibility provided by ZFS.

Lastly, given the increasing demands not only on vast amounts of storage, but on the continuous availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities, any inefficiency in traditional file systems will soon, if it hasn't already, become extremely intolerable.

Therefore, ZFS should be seriously considered as a smart file system solution for today and for the foreseeable future.

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx].

*Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx].

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186].

* Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T08:03:22Z

Naseido: /* ZFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It is easier to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the developed of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average use today.The average user at the time would not have many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing an retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
The newest file systems quickly closing in on ZFS. The forerunner of the pack is currently Btrfs. Designed with similar motivation as ZFS Btrfs provides a similarly rich feature set that is ideally suited for modern computing needs.

====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types keys, items and block headers. The data type of these trees are referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks the fist is a 64-bit object id followed by 8-bits denoting the item’s type and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive with interior nodes containing on keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data
.
As a minimum the Btrfs filesystem will contain three trees a tree that has contains a other tree roots. A subvolumes which holds file and directories. An extents volume that contains the information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal piece of a number of features of the Bdfs. Writes never occur on the same blocks. A transaction log is created and writes are cached the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the filesystem. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Bdfs provides a very simple underlying implementation ( algorithms non-withstanding ) and provides a number of features that help future proof this file system.

====WinFS====
====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features that ZFS does have. Btrfs doesn't have the self-healing capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs] ZFS does have three more years of development than Btrfs so Btrfs may very well catch up.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks and the problems with administrators continuously switching disks, performing partitions, copying old data to the new larger disks there was a significant amount of user downtime. Therefore, the administrators were attracted to the storage pool concept, and to the added flexibility provided by ZFS.

Lastly, given the increasing demands not only on vast amounts of storage, but on the continuous availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities, any inefficiency in traditional file systems will soon, if it hasn't already, become extremely intolerable.

Therefore, ZFS should be seriously considered as a smart file system solution for today and for the foreseeable future.

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx].

*Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx].

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186].

* Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T07:56:06Z

Naseido: /* Conclusion */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the developed of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average use today.The average user at the time would not have many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing an retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
The newest file systems quickly closing in on ZFS. The forerunner of the pack is currently Btrfs. Designed with similar motivation as ZFS Btrfs provides a similarly rich feature set that is ideally suited for modern computing needs.

====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types keys, items and block headers. The data type of these trees are referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks the fist is a 64-bit object id followed by 8-bits denoting the item’s type and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive with interior nodes containing on keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data
.
As a minimum the Btrfs filesystem will contain three trees a tree that has contains a other tree roots. A subvolumes which holds file and directories. An extents volume that contains the information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal piece of a number of features of the Bdfs. Writes never occur on the same blocks. A transaction log is created and writes are cached the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the filesystem. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Bdfs provides a very simple underlying implementation ( algorithms non-withstanding ) and provides a number of features that help future proof this file system.

====WinFS====
====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features that ZFS does have. Btrfs doesn't have the self-healing capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs] ZFS does have three more years of development than Btrfs so Btrfs may very well catch up.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks and the problems with administrators continuously switching disks, performing partitions, copying old data to the new larger disks there was a significant amount of user downtime. Therefore, the administrators were attracted to the storage pool concept, and to the added flexibility provided by ZFS.

Lastly, given the increasing demands not only on vast amounts of storage, but on the continues availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities, any inefficiency in traditional file systems will soon, if it hasn't already, become extremely intolerable.

Therefore, ZFS should be seriously considered as a smart file system solution for today and for the foreseeable future.

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx].

*Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx].

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186].

* Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T07:45:52Z

Naseido: /* References */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the developed of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average use today.The average user at the time would not have many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing an retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types keys, items and block headers. The data type of these trees are referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks the fist is a 64-bit object id followed by 8-bits denoting the item’s type and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive with interior nodes containing on keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data
.
As a minimum the Btrfs filesystem will contain three trees a tree that has contains a other tree roots. A subvolumes which holds file and directories. An extents volume that contains the information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal piece of a number of features of the Bdfs. Writes never occur on the same blocks. A transaction log is created and writes are cached the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the filesystem. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Bdfs provides a very simple underlying implementation ( algorithms non-withstanding ) and provides a number of features that help future proof this file system.

====WinFS====
====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features that ZFS does have. Btrfs doesn't have the self-healing capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs] ZFS does have three more years of development than Btrfs so Btrfs may very well catch up.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that
was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks, the administrators kept having to switch disks, perform partitions, copy old data to the new larger disks . . etc. Naturally, that meant users' downtime.

The administrators were attracted to the storage pool concept in ZFS, as well as the fact that they could create a storage pool (zpool) in few simple commands. Besides the aforementioned features, ZFS offers some additional flexibility.

TO-DO: Performance V.S. efficiency. ZFS provides the ability to do away with checksumming

TO-DO: Plasma Physics Example

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx].

*Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx].

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186].

* Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T07:43:07Z

Naseido: /* References */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the developed of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average use today.The average user at the time would not have many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing an retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types keys, items and block headers. The data type of these trees are referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks the fist is a 64-bit object id followed by 8-bits denoting the item’s type and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive with interior nodes containing on keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data
.
As a minimum the Btrfs filesystem will contain three trees a tree that has contains a other tree roots. A subvolumes which holds file and directories. An extents volume that contains the information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal piece of a number of features of the Bdfs. Writes never occur on the same blocks. A transaction log is created and writes are cached the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the filesystem. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Bdfs provides a very simple underlying implementation ( algorithms non-withstanding ) and provides a number of features that help future proof this file system.

====WinFS====
====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features that ZFS does have. Btrfs doesn't have the self-healing capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs] ZFS does have three more years of development than Btrfs so Btrfs may very well catch up.

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that
was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks, the administrators kept having to switch disks, perform partitions, copy old data to the new larger disks . . etc. Naturally, that meant users' downtime.

The administrators were attracted to the storage pool concept in ZFS, as well as the fact that they could create a storage pool (zpool) in few simple commands. Besides the aforementioned features, ZFS offers some additional flexibility.

TO-DO: Performance V.S. efficiency. ZFS provides the ability to do away with checksumming

TO-DO: Plasma Physics Example

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

* S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

* Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

* Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

* Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

* Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

* ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

* Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

* Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*Microsoft- TechNet.(March 28, 2003) "How NTFS Works"
[http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx]

*Microsoft-TechNet. "File Systems"
[http://technet.microsoft.com/en-us/library/cc938919.aspx]

*Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows"
[http://support.microsoft.com/kb/251186]

* MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The
new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T07:31:56Z

Naseido: /* Current File Systems */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the developed of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average use today.The average user at the time would not have many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing an retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types keys, items and block headers. The data type of these trees are referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks the fist is a 64-bit object id followed by 8-bits denoting the item’s type and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive with interior nodes containing on keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data
.
As a minimum the Btrfs filesystem will contain three trees a tree that has contains a other tree roots. A subvolumes which holds file and directories. An extents volume that contains the information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal piece of a number of features of the Bdfs. Writes never occur on the same blocks. A transaction log is created and writes are cached the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the filesystem. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Bdfs provides a very simple underlying implementation ( algorithms non-withstanding ) and provides a number of features that help future proof this file system.

====WinFS====
====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features the ZFS does have. Btrfs doesn't have the self-heading capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs]

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that
was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks, the administrators kept having to switch disks, perform partitions, copy old data to the new larger disks . . etc. Naturally, that meant users' downtime.

The administrators were attracted to the storage pool concept in ZFS, as well as the fact that they could create a storage pool (zpool) in few simple commands. Besides the aforementioned features, ZFS offers some additional flexibility.

TO-DO: Performance V.S. efficiency. ZFS provides the ability to do away with checksumming

TO-DO: Plasma Physics Example

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The
new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T07:27:31Z

Naseido: /* Legacy File Systems */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the developed of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average use today.The average user at the time would not have many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing an retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types keys, items and block headers. The data type of these trees are referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks the fist is a 64-bit object id followed by 8-bits denoting the item’s type and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive with interior nodes containing on keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data
.
As a minimum the Btrfs filesystem will contain three trees a tree that has contains a other tree roots. A subvolumes which holds file and directories. An extents volume that contains the information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal piece of a number of features of the Bdfs. Writes never occur on the same blocks. A transaction log is created and writes are cached the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the filesystem. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Bdfs provides a very simple underlying implementation ( algorithms non-withstanding ) and provides a number of features that help future proof this file system.

====WinFS====
====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features the ZFS does have. Btrfs doesn't have the self-heading capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs]

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that
was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks, the administrators kept having to switch disks, perform partitions, copy old data to the new larger disks . . etc. Naturally, that meant users' downtime.

The administrators were attracted to the storage pool concept in ZFS, as well as the fact that they could create a storage pool (zpool) in few simple commands. Besides the aforementioned features, ZFS offers some additional flexibility.

TO-DO: Performance V.S. efficiency. ZFS provides the ability to do away with checksumming

TO-DO: Plasma Physics Example

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The
new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T07:27:04Z

Naseido: /* Legacy File Systems */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==

In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems a how they have influenced the developed of current file systems and how they compare to a system such as ZFS.

One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average use today.The average user at the time would not have many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).

====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.

==== Comparison ====
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing an retrieving data they cannot be reasonably compared to ZFS.

== '''Current File Systems''' ==

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types keys, items and block headers. The data type of these trees are referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks the fist is a 64-bit object id followed by 8-bits denoting the item’s type and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.

The trees are constructed of these primitive with interior nodes containing on keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data
.
As a minimum the Btrfs filesystem will contain three trees a tree that has contains a other tree roots. A subvolumes which holds file and directories. An extents volume that contains the information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees are added to support other features such as logs, chunks and data relocation.

The copy on write method of the system is a pivotal piece of a number of features of the Bdfs. Writes never occur on the same blocks. A transaction log is created and writes are cached the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the filesystem. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.

Bdfs provides a very simple underlying implementation ( algorithms non-withstanding ) and provides a number of features that help future proof this file system.

====WinFS====
====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features the ZFS does have. Btrfs doesn't have the self-heading capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs]

== '''Conclusion''' ==
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that
was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].

The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks, the administrators kept having to switch disks, perform partitions, copy old data to the new larger disks . . etc. Naturally, that meant users' downtime.

The administrators were attracted to the storage pool concept in ZFS, as well as the fact that they could create a storage pool (zpool) in few simple commands. Besides the aforementioned features, ZFS offers some additional flexibility.

TO-DO: Performance V.S. efficiency. ZFS provides the ability to do away with checksumming

TO-DO: Plasma Physics Example

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The
new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T07:04:42Z

Naseido: /* Introduction */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. These trees are composed of three data types keys, items and block headers. The data type of these trees are referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks the fist is a 64-bit object id followed by 8-bits denoting the item’s type and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.
The trees are constructed of these primitive with interior nodes containing on keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data.
As a minimum the Btrfs filesystem will contain three trees a tree that has contains a other tree roots. A subvolumes which holds file and directories. An extents volume that contains the information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees are added to support other features such as logs, chucks and data relocation.

====WinFS====
====Comparison====
Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features the ZFS does have. Btrfs doesn't have the self-heading capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs]

== '''Conclusion''' ==

TO-DO: Performance V.S. efficiency. ZFS provides the ability to do away with checksumming

TO-DO: Plasma Physics Example

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The
new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T06:25:12Z

Naseido: /* NTFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.

ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.

The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.

Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.

Btrfs was designed with efficiency, integrity and maintainability. Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Bdfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and it supports massive amounts of data.

Btrfs is based on the b-tree structure. In Btrfs not only are files and directories stored in a B-tree but so are file system data such as extents, transaction logs, data relocation, chuncks. where a subvolume is a named b-tree made up of the files and directories stored.

The implementation of Btrfs is quite interesting the developers created it with good coding practices. Fist of all Btrfs has very simple underlying components this allows wasy understanding of the basic components by most developers. Additionally the system uses the B-trees structure extensively in addition to actual data and meta-data storage the structures that maintain and track space allocation use B-trees as well.

Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features the ZFS does have. Btrfs doesn't have the self-heading capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs]

====WinFS====
====Comparison====

== '''Conclusion''' ==

TO-DO: Performance V.S. efficiency. ZFS provides the ability to do away with checksumming

TO-DO: Plasma Physics Example

== '''References''' ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The
new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

* Chris Mason, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle