Soma-notes - User contributions [en]

Talk:COMP 3000 Essay 2 2010 Question 6

2010-12-02T03:56:50Z

Azemanci:

Alright so its due tomorrow. I was hoping to get an idea of when everyone will be posting there completed sections thanks. --[[User:Azemanci|Azemanci]] 03:56, 2 December 2010 (UTC)

'''Actual group members'''

- Nicholas Shires nshires@connect.carleton.ca

- Andrew Zemancik andy.zemancik@gmail.com

- [[user:abondio2|Austin Bondio]] -> abondio2@connect.carleton.ca

- David Krutsko :: dkrutsko at connect.carleton.ca

If everyone could just post there names and contact information.--[[User:Azemanci|Azemanci]] 02:57, 15 November 2010 (UTC)

IMPORTANT 
THINGS WE NEED TO DEFINE: 
* Happens-before reasoning
* Lock-set based reasoning
* Hardware Breakpoints 
The prof seemed to be very focused on hardware breakpoints, so it is very important to define it well, and talk about it often, it looks like hardware breakpoints are the one thing thats setting DataCollider apart from other race detectors, so lets focus on it! 
IMPORTANT 

'''Who's Doing What'''

=Research Problem=
I'll do 'Research Problem' and help out with the 'Critique' section, the professor said that part was pretty big [[User:Nshires|Nshires]] 20:45, 21 November 2010 (UTC)

The research problem being addressed by this paper is the detection of erroneous data races inside the kernel without creating much overhead. This problem occurs because read/write access instructions in processes are not always atomic (e.g two read/write commands may happen simultaneously). There are so many ways a data race error may occur that it is very hard to catch them all.

The research team’s program DataCollider needs to detect errors between the hardware and kernel as well as errors in context thread synchronization in the kernel which must synchronize between user-mode processes, interrupts and deferred procedure calls. As shown in the Background Concepts section, this error can create unwanted problems in kernel modules. The research group created DataCollider which puts breakpoints in memory accesses to check if two system calls are calling the same piece of memory. There have been attempts at a solution to this problem in the past that ran in user-mode, but not in kernel mode, and they produced excessive overhead. There are many problems with trying to apply these techniques to a kernel.

One technique that some detectors in the past have used is the “happens before” method. This checks whether one access happened before another or if the other happened first, and if neither of those options were the case, the two accesses were done simultaneously. This method gathers true data race errors but is very hard to implement.

Another method used is the “lock-set” approach. This method checks all of the locks that are held currently by a thread, and if all the accesses do not have at least one common lock, the method sends a warning. This method has many false alarms since many variables nowadays are shared using other ways than locks or have very complex locking systems that lockset cannot understand.

Both these methods produce excessive overhead due to the fact that they have to check every single memory call at runtime. In the next section we will discuss how DataCollider uses a new way to check for data race errors, that produces barely any overhead.
http://www.hpcaconf.org/hpca13/papers/014-zhou.pdf

Moved from main page: (p.s thanks for the info!)[[User:Nshires|Nshires]] 02:32, 2 December 2010 (UTC)

Just a few rough notes:
Research problem / challenges for traditional detectors:

- data-race detectors run in user mode, whereas operating systems run kernel mode (supervisor mode).

- There are a lot of different synchronization methods, and a lot of ways to implement them. So it's nearly impossible to try and code a program that can catch all of them.

- Some kernel modules can "speak privately" with hardware components, so you can't make a program that just logs all the kernel's interactions.

- traditional data race detectors incur massive time overheads because they have to keep an eye on every single memory transaction that occurs at runtime.

--[[User:Abondio2|Austin Bondio]] 01:57, 2 December 2010 (UTC)

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

Ill do Contribution: [[User:Achamney|Achamney]] 03:50, 22 November 2010 (UTC) 

Proving that DataCollider is better: 
The key part of the contribution of this essay is its competition. The research team for DataCollider looked at several other implementations of race condition testers to find ways of improving their own program, or to look for different ways of solving the same problem. 

Some of the programs that were referenced were: 

* Eraser: A Dynamic Data Race Detector for Multithreaded Programs 
* RaceTrack: Efficient Detection of Data Race Conditions via Adaptive Tracking 
* PACER: Proportional Detection of Data Races 
* LiteRace: Effective Sampling for Lightweight Data-Race Detection 
* FastTrack: Efficient and Precise Dynamic Race Detection 
* MultiRace: Efficient on-the-fly data race detection in multithreaded C++ programs 
* RacerX: Effective, Static Detection of Race Conditions and Deadlocks 

Eraser: A Dynamic Data Race Detector for Multithreaded Programs 
lock-set based reasoning 

Eraser, a data race detector programmed in 1997, was one of the earlier data race detectors on the market. It may have been a useful and revolutionary program of its time, however, it uses very low level techniques compared to most data race detectors today. One of the reason why it is unsuccessful is because it only checks whether memory accesses use proper locking techniques. If a memory access is found that does not use a lock, then Eraser will report a data race. In many cases, the misuse of proper locking techniques is a conscious decision by the programmer, so Eraser will report many false positives. This also does not take into account all of the benign problems such as date of access variables. DataCollider used this source as an example of a lock-set based program, and why they are a poor choice for a race condition debugger. 

PACER: Proportional Detection of Data Races 
happens-before reasoning 
Pacer, a happens-before reasoning data race detector, uses the FastTrack algorithm to detect data races. FastTrack uses vector-clocks to keep track of two threads, and find whether or not they are conflicting in any way. Pacer samples some percentage of each memory access, (from 1 to 3 percent) and runs the FastTrack happens-before algorithm on each thread that accesses that part of memory. DataCollider used this source as an example of the implementation of sampling. Similar to Pacer, DataCollider samples some memory accesses, but instead of using vector-clocks to catch the second thread, they use hardware break points. Hardware break points are considerably faster, and cause DataCollider to run much faster than Pacer. 

LiteRace: Effective Sampling for Lightweight Data-Race Detection 
happens-before reasoning 
LiteRace, similar to Pacer, samples a percentage of memory accesses from a program. Where it differs is the parts of memory that LiteRace samples the most. The "hot spot" regions of memory are ones that are accessed most by the program. Since they are accessed the most, chances are that they have already been successfully debugged, or if there are data races there, they are benign. LiteRace detects these areas in memory as hot spots, and samples them at a much lower rate. This improves LiteRace's chances of capturing a valid data race at a much lower sampling rate. Where DataCollider bests LiteRace is based on LiteRace's installing mechanism. LiteRace needs to be recompiled into the software it is trying to debug, whereas DataColleder's breakpoints do not require any code changes to the program. This is a major success for DataCollider because often third party testers do not have the source code for a program. 

RaceTrack: Efficient Detection of Data Race Conditions via Adaptive Trackings 
combo of lock-set and happens-before reasoning 
HIGH OVERHEAD[http://www.cs.ucla.edu/~dlmarino/pubs/pldi09.pdf]

MultiRace: Efficient on-the-fly data race detection in multithreaded C++ programs 
combo of lock-set and happens-before reasoning 

I've noticed a couple things for controversy, even though its not my topic
The biggest thing i saw was that dataCollider reports non-erroneous operations 90% of the time. This makes the user have to sift through all of the reports to separate the problems from the benign races. [[User:Achamney|Achamney]] 17:18, 22 November 2010 (UTC) 

=Background Concepts=
Hey guys, sorry I'm late to the party. I'll get started with Background Concepts. - [[user:abondio2|Austin Bondio]] 15:33, 23 November 2010 (UTC)

=Critique=

I'll work on the critique which will probably need more then one person and I'll also fill out the paper information section.--[[User:Azemanci|Azemanci]] 18:42, 23 November 2010 (UTC)

DataCollider:
DataCollider seems like a very innovative piece of software. It’s new use of breakpoints inside kernel-space instead of lock-set or happens-before methods in user-mode let it check data race errors in the very kernel itself without producing as much overhead as its old contenders (it even finds data races for overheads less than five percent). One thing to note about DataCollider is that ninety percent of its output to the user is false alarms. This means that after running DataCollider, the user has to sift through all of the gathered data to find the ten percent of data that actually contains real data race errors. The team of creator’s were able to create a to sort through all of the material it collects to only spit out the valuable information, but the creators still found some false alarms in the output . They have noted though that some users like to see the benign reports so that they can make design changes to their programs to make them more portable and scalable and therefore decided not to implement this. Even though DataCollider returns 90% false alarms the projects team have still been able to locate 25 errors in the Windows operating system. Of those 25 errors 12 have already been fixed. This shows that DataCollider is an effective tool in locating data race errors within the kernel effectively enough that they can be corrected.

feel free to add/edit anything [[User:Nshires|Nshires]] 02:54, 2 December 2010 (UTC)

Right on thanks for that I was just about to start writing a section on data collider I'm not really sure what else we can critique.--[[User:Azemanci|Azemanci]] 03:11, 2 December 2010 (UTC)

I added a few things to the what you wrote and I also moved it to the main page. --[[User:Azemanci|Azemanci]] 03:22, 2 December 2010 (UTC)

COMP 3000 Essay 2 2010 Question 6

2010-12-02T03:33:40Z

Azemanci: /* Critique */

=Paper=
'''Effective Data-Race Detection for the Kernel'''

Paper: http://www.usenix.org/events/osdi10/tech/full_papers/Erickson.pdf

Video: http://homeostasis.scs.carleton.ca/osdi/video/erickson.mp4

Authors: John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk from Microsoft Research

=Background Concepts=
Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.

A data race is a potentially catastrophic event which can be alarmingly common in modern concurrent systems. When one thread attempts to read or write on a memory location at the same time that another thread is writing on the same location, there exists a potential data race condition. If the race is not handled properly, it could have a wide range of negative consequences. In the best case, there might be data corruption rendering the affected files unreadable and useless; this may not be a major problem if there exist archived, non-corrupted versions of the data. In the worst case, a process (possibly even the operating system itself) may freak out and crash, unable to decide what to do about the unexpected input it receives.

Traditional data-race detection programs operate by running an isolated runtime and comparing it with the currently active runtime, to find situations that would have resulted in a data race if the runtimes were not isolated. DataCollider operates by temporarily setting up breakpoints at random memory access instances. If a certain memory access hits a breakpoint, DataCollider springs into action. The breakpoint causes the memory access instruction to be postponed, and so the instruction pretty much goes to sleep until DataCollider has finished its job. The job is like taking a before and after photograph of something; DataCollider records the data stored at the address the instruction was attempting to access, then allows the instruction to execute. Then DataCollider records the data again. If the before and after records do not match, then another thread has tampered with the data at the same time that this instruction was trying to read it; this is precisely the definition of a data race.

[Don't worry guys; that's not all I've got. I'm still working on it.]

--[[User:Abondio2|Austin Bondio]] 01:56, 2 December 2010 (UTC)

=Research problem=
What is the research problem being addressed by the paper? How does this problem relate to past related work?

The research problem being addressed by this paper is the detection of erroneous data races inside the kernel without creating much overhead. This problem occurs because read/write access instructions in processes are not always atomic (e.g two read/write commands may happen simultaneously). There are so many ways a data race error may occur that it is very hard to catch them all.

The research team’s program DataCollider needs to detect errors between the hardware and kernel as well as errors in context thread synchronization in the kernel which must synchronize between user-mode processes, interrupts and deferred procedure calls. As shown in the Background Concepts section, this error can create unwanted problems in kernel modules. The research group created DataCollider which puts breakpoints in memory accesses to check if two system calls are calling the same piece of memory. There have been attempts at a solution to this problem in the past that ran in user-mode, but not in kernel mode, and they produced excessive overhead. There are many problems with trying to apply these techniques to a kernel.

One technique that some detectors in the past have used is the “happens before” method. This checks whether one access happened before another or if the other happened first, and if neither of those options were the case, the two accesses were done simultaneously. This method gathers true data race errors but is very hard to implement.

Another method used is the “lock-set” approach. This method checks all of the locks that are held currently by a thread, and if all the accesses do not have at least one common lock, the method sends a warning. This method has many false alarms since many variables nowadays are shared using other ways than locks or have very complex locking systems that lockset cannot understand.

Both these methods produce excessive overhead due to the fact that they have to check every single memory call at runtime. In the next section we will discuss how DataCollider uses a new way to check for data race errors, that produces barely any overhead.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

=Critique=

===Style===
This paper is well put together. It has a strong flow and there is nothing that seems out of place. The authors start with an introduction and then immediately identify key definitions that are used throughout the paper. In the second section which follows the introduction the authors identify the definition of a Data-Race as it relates to their paper. This is important since it is a key concept that is required to understand the entire paper. This definition is required because as the authors state there is no standard for exactly how to define a data-race.[1] In addition to important definitions any background information that is relevant to this paper is presented at the beginning. The key idea which the paper is based on in this case Data Collider and its implementation is explained. An evaluation and conclusion of Data Collider follow its description. The order of the sections makes sense and the author is not jumping around from one concept to another. The organization of the sections and information provided make the paper easy to follow and understand.

===Content===
=====Data Collider:=====
DataCollider seems like a very innovative piece of software. It’s new use of breakpoints inside kernel-space instead of lock-set or happens-before methods in user-mode let it check data race errors in the very kernel itself without producing as much overhead as its old contenders (it even finds data races for overheads less than five percent). One thing to note about DataCollider is that ninety percent of its output to the user is false alarms. This means that after running DataCollider, the user has to sift through all of the gathered data to find the ten percent of data that actually contains real data race errors.[1] The team of creator’s were able to create a to sort through all of the material it collects to only spit out the valuable information, but the creators still found some false alarms in the output . They have noted though that some users like to see the benign reports so that they can make design changes to their programs to make them more portable and scalable and therefore decided not to implement this. Even though DataCollider returns 90% false alarms the projects team have still been able to locate 25 errors in the Windows operating system. Of those 25 errors 12 have already been fixed.[1] This shows that DataCollider is an effective tool in locating data race errors within the kernel effectively enough that they can be corrected.

The overhead of any application running is very important to all users. The developers of DataCollider ran various tests to determine the overhead of running DataCollider based on the number of breakpoints. These results were included in the final paper. DataCollider has a low overall base overhead and it is only after 1000 breakpoints a second does the run time overhead increase drastically.[1] This adds to the effectiveness of DataCollider. Having a low overhead is very important to use of an application.

=References=
[1] Erickson, Musuvathi, Burchhardt, Olynyk, Effective Data-Race Detection for the Kernel, Microsoft Research, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Erickson.pdf PDF]

COMP 3000 Essay 2 2010 Question 6

2010-12-02T03:30:28Z

Azemanci: /* Data Collider: */

=Paper=
'''Effective Data-Race Detection for the Kernel'''

Paper: http://www.usenix.org/events/osdi10/tech/full_papers/Erickson.pdf

Video: http://homeostasis.scs.carleton.ca/osdi/video/erickson.mp4

Authors: John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk from Microsoft Research

=Background Concepts=
Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.

A data race is a potentially catastrophic event which can be alarmingly common in modern concurrent systems. When one thread attempts to read or write on a memory location at the same time that another thread is writing on the same location, there exists a potential data race condition. If the race is not handled properly, it could have a wide range of negative consequences. In the best case, there might be data corruption rendering the affected files unreadable and useless; this may not be a major problem if there exist archived, non-corrupted versions of the data. In the worst case, a process (possibly even the operating system itself) may freak out and crash, unable to decide what to do about the unexpected input it receives.

Traditional data-race detection programs operate by running an isolated runtime and comparing it with the currently active runtime, to find situations that would have resulted in a data race if the runtimes were not isolated. DataCollider operates by temporarily setting up breakpoints at random memory access instances. If a certain memory access hits a breakpoint, DataCollider springs into action. The breakpoint causes the memory access instruction to be postponed, and so the instruction pretty much goes to sleep until DataCollider has finished its job. The job is like taking a before and after photograph of something; DataCollider records the data stored at the address the instruction was attempting to access, then allows the instruction to execute. Then DataCollider records the data again. If the before and after records do not match, then another thread has tampered with the data at the same time that this instruction was trying to read it; this is precisely the definition of a data race.

[Don't worry guys; that's not all I've got. I'm still working on it.]

--[[User:Abondio2|Austin Bondio]] 01:56, 2 December 2010 (UTC)

=Research problem=
What is the research problem being addressed by the paper? How does this problem relate to past related work?

The research problem being addressed by this paper is the detection of erroneous data races inside the kernel without creating much overhead. This problem occurs because read/write access instructions in processes are not always atomic (e.g two read/write commands may happen simultaneously). There are so many ways a data race error may occur that it is very hard to catch them all.

The research team’s program DataCollider needs to detect errors between the hardware and kernel as well as errors in context thread synchronization in the kernel which must synchronize between user-mode processes, interrupts and deferred procedure calls. As shown in the Background Concepts section, this error can create unwanted problems in kernel modules. The research group created DataCollider which puts breakpoints in memory accesses to check if two system calls are calling the same piece of memory. There have been attempts at a solution to this problem in the past that ran in user-mode, but not in kernel mode, and they produced excessive overhead. There are many problems with trying to apply these techniques to a kernel.

One technique that some detectors in the past have used is the “happens before” method. This checks whether one access happened before another or if the other happened first, and if neither of those options were the case, the two accesses were done simultaneously. This method gathers true data race errors but is very hard to implement.

Another method used is the “lock-set” approach. This method checks all of the locks that are held currently by a thread, and if all the accesses do not have at least one common lock, the method sends a warning. This method has many false alarms since many variables nowadays are shared using other ways than locks or have very complex locking systems that lockset cannot understand.

Both these methods produce excessive overhead due to the fact that they have to check every single memory call at runtime. In the next section we will discuss how DataCollider uses a new way to check for data race errors, that produces barely any overhead.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

=Critique=
What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.

===Style===
This paper is well put together. It has a strong flow and there is nothing that seems out of place. The authors start with an introduction and then immediately identify key definitions that are used throughout the paper. In the second section which follows the introduction the authors identify the definition of a Data-Race as it relates to their paper. This is important since it is a key concept that is required to understand the entire paper. This definition is required because as the authors state there is no standard for exactly how to define a data-race.[1] In addition to important definitions any background information that is relevant to this paper is presented at the beginning. The key idea which the paper is based on in this case Data Collider and its implementation is explained. An evaluation and conclusion of Data Collider follow its description. The order of the sections makes sense and the author is not jumping around from one concept to another. The organization of the sections and information provided make the paper easy to follow and understand.

===Content===
=====Data Collider:=====
DataCollider seems like a very innovative piece of software. It’s new use of breakpoints inside kernel-space instead of lock-set or happens-before methods in user-mode let it check data race errors in the very kernel itself without producing as much overhead as its old contenders (it even finds data races for overheads less than five percent). One thing to note about DataCollider is that ninety percent of its output to the user is false alarms. This means that after running DataCollider, the user has to sift through all of the gathered data to find the ten percent of data that actually contains real data race errors.[1] The team of creator’s were able to create a to sort through all of the material it collects to only spit out the valuable information, but the creators still found some false alarms in the output . They have noted though that some users like to see the benign reports so that they can make design changes to their programs to make them more portable and scalable and therefore decided not to implement this. Even though DataCollider returns 90% false alarms the projects team have still been able to locate 25 errors in the Windows operating system. Of those 25 errors 12 have already been fixed.[1] This shows that DataCollider is an effective tool in locating data race errors within the kernel effectively enough that they can be corrected.

The overhead of any application running is very important to all users. The developers of DataCollider ran various tests to determine the overhead of running DataCollider based on the number of breakpoints. These results were included in the final paper. DataCollider has a low overall base overhead and it is only after 1000 breakpoints a second does the run time overhead increase drastically.[1] This adds to the effectiveness of DataCollider. Having a low overhead is very important to use of an application.

=References=
[1] Erickson, Musuvathi, Burchhardt, Olynyk, Effective Data-Race Detection for the Kernel, Microsoft Research, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Erickson.pdf PDF]

Talk:COMP 3000 Essay 2 2010 Question 6

2010-12-02T03:22:34Z

Azemanci: /* Critique */

'''Actual group members'''

- Nicholas Shires nshires@connect.carleton.ca

- Andrew Zemancik andy.zemancik@gmail.com

- [[user:abondio2|Austin Bondio]] -> abondio2@connect.carleton.ca

- David Krutsko :: dkrutsko at connect.carleton.ca

If everyone could just post there names and contact information.--[[User:Azemanci|Azemanci]] 02:57, 15 November 2010 (UTC)

IMPORTANT 
THINGS WE NEED TO DEFINE: 
* Happens-before reasoning
* Lock-set based reasoning
* Hardware Breakpoints 
The prof seemed to be very focused on hardware breakpoints, so it is very important to define it well, and talk about it often, it looks like hardware breakpoints are the one thing thats setting DataCollider apart from other race detectors, so lets focus on it! 
IMPORTANT 

'''Who's Doing What'''

=Research Problem=
I'll do 'Research Problem' and help out with the 'Critique' section, the professor said that part was pretty big [[User:Nshires|Nshires]] 20:45, 21 November 2010 (UTC)

The research problem being addressed by this paper is the detection of erroneous data races inside the kernel without creating much overhead. This problem occurs because read/write access instructions in processes are not always atomic (e.g two read/write commands may happen simultaneously). There are so many ways a data race error may occur that it is very hard to catch them all.

The research team’s program DataCollider needs to detect errors between the hardware and kernel as well as errors in context thread synchronization in the kernel which must synchronize between user-mode processes, interrupts and deferred procedure calls. As shown in the Background Concepts section, this error can create unwanted problems in kernel modules. The research group created DataCollider which puts breakpoints in memory accesses to check if two system calls are calling the same piece of memory. There have been attempts at a solution to this problem in the past that ran in user-mode, but not in kernel mode, and they produced excessive overhead. There are many problems with trying to apply these techniques to a kernel.

One technique that some detectors in the past have used is the “happens before” method. This checks whether one access happened before another or if the other happened first, and if neither of those options were the case, the two accesses were done simultaneously. This method gathers true data race errors but is very hard to implement.

Another method used is the “lock-set” approach. This method checks all of the locks that are held currently by a thread, and if all the accesses do not have at least one common lock, the method sends a warning. This method has many false alarms since many variables nowadays are shared using other ways than locks or have very complex locking systems that lockset cannot understand.

Both these methods produce excessive overhead due to the fact that they have to check every single memory call at runtime. In the next section we will discuss how DataCollider uses a new way to check for data race errors, that produces barely any overhead.
http://www.hpcaconf.org/hpca13/papers/014-zhou.pdf

Moved from main page: (p.s thanks for the info!)[[User:Nshires|Nshires]] 02:32, 2 December 2010 (UTC)

Just a few rough notes:
Research problem / challenges for traditional detectors:

- data-race detectors run in user mode, whereas operating systems run kernel mode (supervisor mode).

- There are a lot of different synchronization methods, and a lot of ways to implement them. So it's nearly impossible to try and code a program that can catch all of them.

- Some kernel modules can "speak privately" with hardware components, so you can't make a program that just logs all the kernel's interactions.

- traditional data race detectors incur massive time overheads because they have to keep an eye on every single memory transaction that occurs at runtime.

--[[User:Abondio2|Austin Bondio]] 01:57, 2 December 2010 (UTC)

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

Ill do Contribution: [[User:Achamney|Achamney]] 03:50, 22 November 2010 (UTC) 

Proving that DataCollider is better: 
The key part of the contribution of this essay is its competition. The research team for DataCollider looked at several other implementations of race condition testers to find ways of improving their own program, or to look for different ways of solving the same problem. 

Some of the programs that were referenced were: 

* Eraser: A Dynamic Data Race Detector for Multithreaded Programs 
* RaceTrack: Efficient Detection of Data Race Conditions via Adaptive Tracking 
* PACER: Proportional Detection of Data Races 
* LiteRace: Effective Sampling for Lightweight Data-Race Detection 
* FastTrack: Efficient and Precise Dynamic Race Detection 
* MultiRace: Efficient on-the-fly data race detection in multithreaded C++ programs 
* RacerX: Effective, Static Detection of Race Conditions and Deadlocks 

Eraser: A Dynamic Data Race Detector for Multithreaded Programs 
lock-set based reasoning 

Eraser, a data race detector programmed in 1997, was one of the earlier data race detectors on the market. It may have been a useful and revolutionary program of its time, however, it uses very low level techniques compared to most data race detectors today. One of the reason why it is unsuccessful is because it only checks whether memory accesses use proper locking techniques. If a memory access is found that does not use a lock, then Eraser will report a data race. In many cases, the misuse of proper locking techniques is a conscious decision by the programmer, so Eraser will report many false positives. This also does not take into account all of the benign problems such as date of access variables. DataCollider used this source as an example of a lock-set based program, and why they are a poor choice for a race condition debugger. 

PACER: Proportional Detection of Data Races 
happens-before reasoning 
Pacer, a happens-before reasoning data race detector, uses the FastTrack algorithm to detect data races. FastTrack uses vector-clocks to keep track of two threads, and find whether or not they are conflicting in any way. Pacer samples some percentage of each memory access, (from 1 to 3 percent) and runs the FastTrack happens-before algorithm on each thread that accesses that part of memory. DataCollider used this source as an example of the implementation of sampling. Similar to Pacer, DataCollider samples some memory accesses, but instead of using vector-clocks to catch the second thread, they use hardware break points. Hardware break points are considerably faster, and cause DataCollider to run much faster than Pacer. 

LiteRace: Effective Sampling for Lightweight Data-Race Detection 
happens-before reasoning 
LiteRace, similar to Pacer, samples a percentage of memory accesses from a program. Where it differs is the parts of memory that LiteRace samples the most. The "hot spot" regions of memory are ones that are accessed most by the program. Since they are accessed the most, chances are that they have already been successfully debugged, or if there are data races there, they are benign. LiteRace detects these areas in memory as hot spots, and samples them at a much lower rate. This improves LiteRace's chances of capturing a valid data race at a much lower sampling rate. Where DataCollider bests LiteRace is based on LiteRace's installing mechanism. LiteRace needs to be recompiled into the software it is trying to debug, whereas DataColleder's breakpoints do not require any code changes to the program. This is a major success for DataCollider because often third party testers do not have the source code for a program. 

RaceTrack: Efficient Detection of Data Race Conditions via Adaptive Trackings 
combo of lock-set and happens-before reasoning 
HIGH OVERHEAD[http://www.cs.ucla.edu/~dlmarino/pubs/pldi09.pdf]

MultiRace: Efficient on-the-fly data race detection in multithreaded C++ programs 
combo of lock-set and happens-before reasoning 

I've noticed a couple things for controversy, even though its not my topic
The biggest thing i saw was that dataCollider reports non-erroneous operations 90% of the time. This makes the user have to sift through all of the reports to separate the problems from the benign races. [[User:Achamney|Achamney]] 17:18, 22 November 2010 (UTC) 

=Background Concepts=
Hey guys, sorry I'm late to the party. I'll get started with Background Concepts. - [[user:abondio2|Austin Bondio]] 15:33, 23 November 2010 (UTC)

=Critique=

I'll work on the critique which will probably need more then one person and I'll also fill out the paper information section.--[[User:Azemanci|Azemanci]] 18:42, 23 November 2010 (UTC)

DataCollider:
DataCollider seems like a very innovative piece of software. It’s new use of breakpoints inside kernel-space instead of lock-set or happens-before methods in user-mode let it check data race errors in the very kernel itself without producing as much overhead as its old contenders (it even finds data races for overheads less than five percent). One thing to note about DataCollider is that ninety percent of its output to the user is false alarms. This means that after running DataCollider, the user has to sift through all of the gathered data to find the ten percent of data that actually contains real data race errors. The team of creator’s were able to create a to sort through all of the material it collects to only spit out the valuable information, but the creators still found some false alarms in the output . They have noted though that some users like to see the benign reports so that they can make design changes to their programs to make them more portable and scalable and therefore decided not to implement this. Even though DataCollider returns 90% false alarms the projects team have still been able to locate 25 errors in the Windows operating system. Of those 25 errors 12 have already been fixed. This shows that DataCollider is an effective tool in locating data race errors within the kernel effectively enough that they can be corrected.

feel free to add/edit anything [[User:Nshires|Nshires]] 02:54, 2 December 2010 (UTC)

Right on thanks for that I was just about to start writing a section on data collider I'm not really sure what else we can critique.--[[User:Azemanci|Azemanci]] 03:11, 2 December 2010 (UTC)

I added a few things to the what you wrote and I also moved it to the main page. --[[User:Azemanci|Azemanci]] 03:22, 2 December 2010 (UTC)

COMP 3000 Essay 2 2010 Question 6

2010-12-02T03:21:48Z

Azemanci: /* Data Collider: */

=Paper=
'''Effective Data-Race Detection for the Kernel'''

Paper: http://www.usenix.org/events/osdi10/tech/full_papers/Erickson.pdf

Video: http://homeostasis.scs.carleton.ca/osdi/video/erickson.mp4

Authors: John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk from Microsoft Research

=Background Concepts=
Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.

A data race is a potentially catastrophic event which can be alarmingly common in modern concurrent systems. When one thread attempts to read or write on a memory location at the same time that another thread is writing on the same location, there exists a potential data race condition. If the race is not handled properly, it could have a wide range of negative consequences. In the best case, there might be data corruption rendering the affected files unreadable and useless; this may not be a major problem if there exist archived, non-corrupted versions of the data. In the worst case, a process (possibly even the operating system itself) may freak out and crash, unable to decide what to do about the unexpected input it receives.

Traditional data-race detection programs operate by running an isolated runtime and comparing it with the currently active runtime, to find situations that would have resulted in a data race if the runtimes were not isolated. DataCollider operates by temporarily setting up breakpoints at random memory access instances. If a certain memory access hits a breakpoint, DataCollider springs into action. The breakpoint causes the memory access instruction to be postponed, and so the instruction pretty much goes to sleep until DataCollider has finished its job. The job is like taking a before and after photograph of something; DataCollider records the data stored at the address the instruction was attempting to access, then allows the instruction to execute. Then DataCollider records the data again. If the before and after records do not match, then another thread has tampered with the data at the same time that this instruction was trying to read it; this is precisely the definition of a data race.

[Don't worry guys; that's not all I've got. I'm still working on it.]

--[[User:Abondio2|Austin Bondio]] 01:56, 2 December 2010 (UTC)

=Research problem=
What is the research problem being addressed by the paper? How does this problem relate to past related work?

The research problem being addressed by this paper is the detection of erroneous data races inside the kernel without creating much overhead. This problem occurs because read/write access instructions in processes are not always atomic (e.g two read/write commands may happen simultaneously). There are so many ways a data race error may occur that it is very hard to catch them all.

The research team’s program DataCollider needs to detect errors between the hardware and kernel as well as errors in context thread synchronization in the kernel which must synchronize between user-mode processes, interrupts and deferred procedure calls. As shown in the Background Concepts section, this error can create unwanted problems in kernel modules. The research group created DataCollider which puts breakpoints in memory accesses to check if two system calls are calling the same piece of memory. There have been attempts at a solution to this problem in the past that ran in user-mode, but not in kernel mode, and they produced excessive overhead. There are many problems with trying to apply these techniques to a kernel.

One technique that some detectors in the past have used is the “happens before” method. This checks whether one access happened before another or if the other happened first, and if neither of those options were the case, the two accesses were done simultaneously. This method gathers true data race errors but is very hard to implement.

Another method used is the “lock-set” approach. This method checks all of the locks that are held currently by a thread, and if all the accesses do not have at least one common lock, the method sends a warning. This method has many false alarms since many variables nowadays are shared using other ways than locks or have very complex locking systems that lockset cannot understand.

Both these methods produce excessive overhead due to the fact that they have to check every single memory call at runtime. In the next section we will discuss how DataCollider uses a new way to check for data race errors, that produces barely any overhead.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

=Critique=
What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.

===Style===
This paper is well put together. It has a strong flow and there is nothing that seems out of place. The authors start with an introduction and then immediately identify key definitions that are used throughout the paper. In the second section which follows the introduction the authors identify the definition of a Data-Race as it relates to their paper. This is important since it is a key concept that is required to understand the entire paper. This definition is required because as the authors state there is no standard for exactly how to define a data-race.[1] In addition to important definitions any background information that is relevant to this paper is presented at the beginning. The key idea which the paper is based on in this case Data Collider and its implementation is explained. An evaluation and conclusion of Data Collider follow its description. The order of the sections makes sense and the author is not jumping around from one concept to another. The organization of the sections and information provided make the paper easy to follow and understand.

===Content===
=====Data Collider:=====
DataCollider seems like a very innovative piece of software. It’s new use of breakpoints inside kernel-space instead of lock-set or happens-before methods in user-mode let it check data race errors in the very kernel itself without producing as much overhead as its old contenders (it even finds data races for overheads less than five percent). One thing to note about DataCollider is that ninety percent of its output to the user is false alarms. This means that after running DataCollider, the user has to sift through all of the gathered data to find the ten percent of data that actually contains real data race errors.[1] The team of creator’s were able to create a to sort through all of the material it collects to only spit out the valuable information, but the creators still found some false alarms in the output . They have noted though that some users like to see the benign reports so that they can make design changes to their programs to make them more portable and scalable and therefore decided not to implement this. Even though DataCollider returns 90% false alarms the projects team have still been able to locate 25 errors in the Windows operating system. Of those 25 errors 12 have already been fixed.[1] This shows that DataCollider is an effective tool in locating data race errors within the kernel effectively enough that they can be corrected.

=References=
[1] Erickson, Musuvathi, Burchhardt, Olynyk, Effective Data-Race Detection for the Kernel, Microsoft Research, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Erickson.pdf PDF]

COMP 3000 Essay 2 2010 Question 6

2010-12-02T03:21:27Z

Azemanci: /* Data Collider: */

=Paper=
'''Effective Data-Race Detection for the Kernel'''

Paper: http://www.usenix.org/events/osdi10/tech/full_papers/Erickson.pdf

Video: http://homeostasis.scs.carleton.ca/osdi/video/erickson.mp4

Authors: John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk from Microsoft Research

=Background Concepts=
Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.

A data race is a potentially catastrophic event which can be alarmingly common in modern concurrent systems. When one thread attempts to read or write on a memory location at the same time that another thread is writing on the same location, there exists a potential data race condition. If the race is not handled properly, it could have a wide range of negative consequences. In the best case, there might be data corruption rendering the affected files unreadable and useless; this may not be a major problem if there exist archived, non-corrupted versions of the data. In the worst case, a process (possibly even the operating system itself) may freak out and crash, unable to decide what to do about the unexpected input it receives.

Traditional data-race detection programs operate by running an isolated runtime and comparing it with the currently active runtime, to find situations that would have resulted in a data race if the runtimes were not isolated. DataCollider operates by temporarily setting up breakpoints at random memory access instances. If a certain memory access hits a breakpoint, DataCollider springs into action. The breakpoint causes the memory access instruction to be postponed, and so the instruction pretty much goes to sleep until DataCollider has finished its job. The job is like taking a before and after photograph of something; DataCollider records the data stored at the address the instruction was attempting to access, then allows the instruction to execute. Then DataCollider records the data again. If the before and after records do not match, then another thread has tampered with the data at the same time that this instruction was trying to read it; this is precisely the definition of a data race.

[Don't worry guys; that's not all I've got. I'm still working on it.]

--[[User:Abondio2|Austin Bondio]] 01:56, 2 December 2010 (UTC)

=Research problem=
What is the research problem being addressed by the paper? How does this problem relate to past related work?

The research problem being addressed by this paper is the detection of erroneous data races inside the kernel without creating much overhead. This problem occurs because read/write access instructions in processes are not always atomic (e.g two read/write commands may happen simultaneously). There are so many ways a data race error may occur that it is very hard to catch them all.

The research team’s program DataCollider needs to detect errors between the hardware and kernel as well as errors in context thread synchronization in the kernel which must synchronize between user-mode processes, interrupts and deferred procedure calls. As shown in the Background Concepts section, this error can create unwanted problems in kernel modules. The research group created DataCollider which puts breakpoints in memory accesses to check if two system calls are calling the same piece of memory. There have been attempts at a solution to this problem in the past that ran in user-mode, but not in kernel mode, and they produced excessive overhead. There are many problems with trying to apply these techniques to a kernel.

One technique that some detectors in the past have used is the “happens before” method. This checks whether one access happened before another or if the other happened first, and if neither of those options were the case, the two accesses were done simultaneously. This method gathers true data race errors but is very hard to implement.

Another method used is the “lock-set” approach. This method checks all of the locks that are held currently by a thread, and if all the accesses do not have at least one common lock, the method sends a warning. This method has many false alarms since many variables nowadays are shared using other ways than locks or have very complex locking systems that lockset cannot understand.

Both these methods produce excessive overhead due to the fact that they have to check every single memory call at runtime. In the next section we will discuss how DataCollider uses a new way to check for data race errors, that produces barely any overhead.

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

=Critique=
What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.

===Style===
This paper is well put together. It has a strong flow and there is nothing that seems out of place. The authors start with an introduction and then immediately identify key definitions that are used throughout the paper. In the second section which follows the introduction the authors identify the definition of a Data-Race as it relates to their paper. This is important since it is a key concept that is required to understand the entire paper. This definition is required because as the authors state there is no standard for exactly how to define a data-race.[1] In addition to important definitions any background information that is relevant to this paper is presented at the beginning. The key idea which the paper is based on in this case Data Collider and its implementation is explained. An evaluation and conclusion of Data Collider follow its description. The order of the sections makes sense and the author is not jumping around from one concept to another. The organization of the sections and information provided make the paper easy to follow and understand.

===Content===
=====Data Collider:=====
DataCollider seems like a very innovative piece of software. It’s new use of breakpoints inside kernel-space instead of lock-set or happens-before methods in user-mode let it check data race errors in the very kernel itself without producing as much overhead as its old contenders (it even finds data races for overheads less than five percent). One thing to note about DataCollider is that ninety percent of its output to the user is false alarms. This means that after running DataCollider, the user has to sift through all of the gathered data to find the ten percent of data that actually contains real data race errors. The team of creator’s were able to create a to sort through all of the material it collects to only spit out the valuable information, but the creators still found some false alarms in the output . They have noted though that some users like to see the benign reports so that they can make design changes to their programs to make them more portable and scalable and therefore decided not to implement this. Even though DataCollider returns 90% false alarms the projects team have still been able to locate 25 errors in the Windows operating system. Of those 25 errors 12 have already been fixed. This shows that DataCollider is an effective tool in locating data race errors within the kernel effectively enough that they can be corrected.

=References=
[1] Erickson, Musuvathi, Burchhardt, Olynyk, Effective Data-Race Detection for the Kernel, Microsoft Research, 2010.[http://www.usenix.org/events/osdi10/tech/full_papers/Erickson.pdf PDF]

Talk:COMP 3000 Essay 2 2010 Question 6

2010-12-02T03:21:07Z

Azemanci: /* Critique */

Talk:COMP 3000 Essay 2 2010 Question 6

2010-12-02T03:11:18Z

Azemanci: /* Critique */

'''Actual group members'''

- Nicholas Shires nshires@connect.carleton.ca

- Andrew Zemancik andy.zemancik@gmail.com

- [[user:abondio2|Austin Bondio]] -> abondio2@connect.carleton.ca

- David Krutsko :: dkrutsko at connect.carleton.ca

If everyone could just post there names and contact information.--[[User:Azemanci|Azemanci]] 02:57, 15 November 2010 (UTC)

IMPORTANT 
THINGS WE NEED TO DEFINE: 
* Happens-before reasoning
* Lock-set based reasoning
* Hardware Breakpoints 
The prof seemed to be very focused on hardware breakpoints, so it is very important to define it well, and talk about it often, it looks like hardware breakpoints are the one thing thats setting DataCollider apart from other race detectors, so lets focus on it! 
IMPORTANT 

'''Who's Doing What'''

=Research Problem=
I'll do 'Research Problem' and help out with the 'Critique' section, the professor said that part was pretty big [[User:Nshires|Nshires]] 20:45, 21 November 2010 (UTC)

The research problem being addressed by this paper is the detection of erroneous data races inside the kernel without creating much overhead. This problem occurs because read/write access instructions in processes are not always atomic (e.g two read/write commands may happen simultaneously). There are so many ways a data race error may occur that it is very hard to catch them all.

The research team’s program DataCollider needs to detect errors between the hardware and kernel as well as errors in context thread synchronization in the kernel which must synchronize between user-mode processes, interrupts and deferred procedure calls. As shown in the Background Concepts section, this error can create unwanted problems in kernel modules. The research group created DataCollider which puts breakpoints in memory accesses to check if two system calls are calling the same piece of memory. There have been attempts at a solution to this problem in the past that ran in user-mode, but not in kernel mode, and they produced excessive overhead. There are many problems with trying to apply these techniques to a kernel.

One technique that some detectors in the past have used is the “happens before” method. This checks whether one access happened before another or if the other happened first, and if neither of those options were the case, the two accesses were done simultaneously. This method gathers true data race errors but is very hard to implement.

Another method used is the “lock-set” approach. This method checks all of the locks that are held currently by a thread, and if all the accesses do not have at least one common lock, the method sends a warning. This method has many false alarms since many variables nowadays are shared using other ways than locks or have very complex locking systems that lockset cannot understand.

Both these methods produce excessive overhead due to the fact that they have to check every single memory call at runtime. In the next section we will discuss how DataCollider uses a new way to check for data race errors, that produces barely any overhead.
http://www.hpcaconf.org/hpca13/papers/014-zhou.pdf

Moved from main page: (p.s thanks for the info!)[[User:Nshires|Nshires]] 02:32, 2 December 2010 (UTC)

Just a few rough notes:
Research problem / challenges for traditional detectors:

- data-race detectors run in user mode, whereas operating systems run kernel mode (supervisor mode).

- There are a lot of different synchronization methods, and a lot of ways to implement them. So it's nearly impossible to try and code a program that can catch all of them.

- Some kernel modules can "speak privately" with hardware components, so you can't make a program that just logs all the kernel's interactions.

- traditional data race detectors incur massive time overheads because they have to keep an eye on every single memory transaction that occurs at runtime.

--[[User:Abondio2|Austin Bondio]] 01:57, 2 December 2010 (UTC)

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

Ill do Contribution: [[User:Achamney|Achamney]] 03:50, 22 November 2010 (UTC) 

Proving that DataCollider is better: 
The key part of the contribution of this essay is its competition. The research team for DataCollider looked at several other implementations of race condition testers to find ways of improving their own program, or to look for different ways of solving the same problem. 

Some of the programs that were referenced were: 

* Eraser: A Dynamic Data Race Detector for Multithreaded Programs 
* RaceTrack: Efficient Detection of Data Race Conditions via Adaptive Tracking 
* PACER: Proportional Detection of Data Races 
* LiteRace: Effective Sampling for Lightweight Data-Race Detection 
* FastTrack: Efficient and Precise Dynamic Race Detection 
* MultiRace: Efficient on-the-fly data race detection in multithreaded C++ programs 
* RacerX: Effective, Static Detection of Race Conditions and Deadlocks 

Eraser: A Dynamic Data Race Detector for Multithreaded Programs 
lock-set based reasoning 

Eraser, a data race detector programmed in 1997, was one of the earlier data race detectors on the market. It may have been a useful and revolutionary program of its time, however, it uses very low level techniques compared to most data race detectors today. One of the reason why it is unsuccessful is because it only checks whether memory accesses use proper locking techniques. If a memory access is found that does not use a lock, then Eraser will report a data race. In many cases, the misuse of proper locking techniques is a conscious decision by the programmer, so Eraser will report many false positives. This also does not take into account all of the benign problems such as date of access variables. DataCollider used this source as an example of a lock-set based program, and why they are a poor choice for a race condition debugger. 

PACER: Proportional Detection of Data Races 
happens-before reasoning 
Pacer, a happens-before reasoning data race detector, uses the FastTrack algorithm to detect data races. FastTrack uses vector-clocks to keep track of two threads, and find whether or not they are conflicting in any way. Pacer samples some percentage of each memory access, (from 1 to 3 percent) and runs the FastTrack happens-before algorithm on each thread that accesses that part of memory. DataCollider used this source as an example of the implementation of sampling. Similar to Pacer, DataCollider samples some memory accesses, but instead of using vector-clocks to catch the second thread, they use hardware break points. Hardware break points are considerably faster, and cause DataCollider to run much faster than Pacer. 

LiteRace: Effective Sampling for Lightweight Data-Race Detection 
happens-before reasoning 
LiteRace, similar to Pacer, samples a percentage of memory accesses from a program. Where it differs is the parts of memory that LiteRace samples the most. The "hot spot" regions of memory are ones that are accessed most by the program. Since they are accessed the most, chances are that they have already been successfully debugged, or if there are data races there, they are benign. LiteRace detects these areas in memory as hot spots, and samples them at a much lower rate. This improves LiteRace's chances of capturing a valid data race at a much lower sampling rate. Where DataCollider bests LiteRace is based on LiteRace's installing mechanism. LiteRace needs to be recompiled into the software it is trying to debug, whereas DataColleder's breakpoints do not require any code changes to the program. This is a major success for DataCollider because often third party testers do not have the source code for a program. 

RaceTrack: Efficient Detection of Data Race Conditions via Adaptive Trackings 
combo of lock-set and happens-before reasoning 
HIGH OVERHEAD[http://www.cs.ucla.edu/~dlmarino/pubs/pldi09.pdf]

MultiRace: Efficient on-the-fly data race detection in multithreaded C++ programs 
combo of lock-set and happens-before reasoning 

I've noticed a couple things for controversy, even though its not my topic
The biggest thing i saw was that dataCollider reports non-erroneous operations 90% of the time. This makes the user have to sift through all of the reports to separate the problems from the benign races. [[User:Achamney|Achamney]] 17:18, 22 November 2010 (UTC) 

=Background Concepts=
Hey guys, sorry I'm late to the party. I'll get started with Background Concepts. - [[user:abondio2|Austin Bondio]] 15:33, 23 November 2010 (UTC)

=Critique=

I'll work on the critique which will probably need more then one person and I'll also fill out the paper information section.--[[User:Azemanci|Azemanci]] 18:42, 23 November 2010 (UTC)

DataCollider:
DataCollider seems like a very innovative piece of software. It’s new use of breakpoints inside kernel-space instead of lock-set or happens-before methods in user-mode let it check data race errors in the very kernel itself without producing as much overhead as its old contenders (it even finds data races for overheads less than five percent). One thing to note about DataCollider is that ninety percent of its output to the user is false alarms. This means that after running DataCollider, the user has to sift through all of the gathered data to find the ten percent of data that actually contains real data race errors. The team of creator’s were able to create a to sort through all of the material it collects to only spit out the valuable information, but the creators still found some false alarms in the output . They have noted though that some users like to see the benign reports so that they can make design changes to their programs to make them more portable and scalable and therefore decided not to implement this.

feel free to add/edit anything [[User:Nshires|Nshires]] 02:54, 2 December 2010 (UTC)

Right on thanks for that I was just about to start writing a section on data collider I'm not really sure what else we can critique.--[[User:Azemanci|Azemanci]] 03:11, 2 December 2010 (UTC)

COMP 3000 Essay 2 2010 Question 6

2010-11-25T03:00:11Z

Azemanci: /* Style */

COMP 3000 Essay 2 2010 Question 6

2010-11-25T02:59:47Z

Azemanci: /* Style */

COMP 3000 Essay 2 2010 Question 6

2010-11-25T02:46:08Z

Azemanci: /* Style */

COMP 3000 Essay 2 2010 Question 6

2010-11-25T02:45:55Z

Azemanci: /* Style */

COMP 3000 Essay 2 2010 Question 6

2010-11-25T00:05:09Z

Azemanci: /* Content */

COMP 3000 Essay 2 2010 Question 6

2010-11-25T00:03:11Z

Azemanci: /* Content */

COMP 3000 Essay 2 2010 Question 6

2010-11-24T21:40:56Z

Azemanci: /* Critique */

COMP 3000 Essay 2 2010 Question 6

2010-11-24T21:39:55Z

Azemanci: /* References */

COMP 3000 Essay 2 2010 Question 6

2010-11-24T21:20:44Z

Azemanci: /* References */

COMP 3000 Essay 2 2010 Question 6

2010-11-24T21:20:29Z

Azemanci: /* References */

COMP 3000 Essay 2 2010 Question 6

2010-11-24T21:16:17Z

Azemanci: /* References */

COMP 3000 Essay 2 2010 Question 6

2010-11-24T21:15:32Z

Azemanci: /* Paper */

COMP 3000 Essay 2 2010 Question 6

2010-11-24T21:13:58Z

Azemanci: /* Paper */

=Paper=
'''Effective Data-Race Detection for the Kernel'''

Paper: http://www.usenix.org/events/osdi10/tech/full_papers/Erickson.pdf

Video:

Authors: John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, Kirk Olynyk from Microsoft Research

=Background Concepts=
Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.

=Research problem=
What is the research problem being addressed by the paper? How does this problem relate to past related work?

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

=Critique=
What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.

=References=
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.

Talk:COMP 3000 Essay 2 2010 Question 6

2010-11-23T18:42:25Z

Azemanci:

COMP 3000 Essay 2 2010 Question 6

2010-11-21T05:20:27Z

Azemanci:

=Paper=
The paper's title, authors, and their affiliations. Include a link to the paper and any particularly helpful supplementary information.

=Background Concepts=
Explain briefly the background concepts and ideas that your fellow classmates will need to know first in order to understand your assigned paper.

=Research problem=
What is the research problem being addressed by the paper? How does this problem relate to past related work?

=Contribution=
What are the research contribution(s) of this work? Specifically, what are the key research results, and what do they mean? (What was implemented? Why is it any better than what came before?)

=Critique=
What is good and not-so-good about this paper? You may discuss both the style and content; be sure to ground your discussion with specific references. Simple assertions that something is good or bad is not enough - you must explain why.

=References=
You will almost certainly have to refer to other resources; please cite these resources in the style of citation of the papers assigned (inlined numbered references). Place your bibliographic entries in this section.

Talk:COMP 3000 Essay 2 2010 Question 6

2010-11-17T22:06:15Z

Azemanci:

Talk:COMP 3000 Essay 2 2010 Question 6

2010-11-15T02:57:37Z

Azemanci:

'''Actual group members'''

- Nicholas Shires

- Andrew Zemancik andy.zemancik@gmail.com

If everyone could just post there names and contact information.--[[User:Azemanci|Azemanci]] 02:57, 15 November 2010 (UTC)

COMP 3000 Essay 2 2010 Question 6

2010-11-15T02:55:44Z

Azemanci: Created page with "See Discussion"

See Discussion

Talk:COMP 3000 Essay 2 2010 Question 6

2010-11-15T02:54:13Z

Azemanci:

'''Actual group members'''

- Nicholas Shires

- Andrew Zemancik

Talk:COMP 3000 Essay 2 2010 Question 6

2010-11-15T02:53:55Z

Azemanci:

'''Actual group members'''

- Nicholas Shires
- Andrew Zemancik

COMP 3000 Essay 1 2010 Question 9

2010-10-15T05:25:03Z

Azemanci: /* References */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller.

ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels of abstraction increase ZFS' flexibility and simplifies its management. [Z3 P2]. Lastly, with respect to flexibility at least, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU
objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. We will tackle the main strategies followed by ZFS.

These can be summed as : checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency thus is assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It helps here to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock [Z1. P8].
Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption.

If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. [Z1. P7-9].

Lastly, in ZFS's arsenal of self-healing, comes copy-on-write.

The DMU uses copy-on-write for all blocks. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified
traversing all the way up to the uberblock. [Z1]

The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

Btrfs, the B-tree File System was started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different.

Btrfs is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

Btrfs upon first inspection seems near identical to ZFS currently however Btrfs lacks a couple features the ZFS does have. Btrfs doesn't have the self-heading capability or data deduplication of ZFS. ZFS also support more configurations of software RAID than Btrfs.[http://en.wikipedia.org/wiki/Btrfs]

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The
new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).

* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies

* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle

COMP 3000 Essay 1 2010 Question 9

2010-10-15T04:15:37Z

Azemanci: /* ext4 */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller.

ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels of abstraction increase ZFS' flexibility and simplifies its management. [Z3 P2]. Lastly, with respect to flexibility at least, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU
objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. We will tackle the main strategies followed by ZFS.

These can be summed as : checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency thus is assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It helps here to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock [Z1. P8].
Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption.

If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. [Z1. P7-9].

Lastly, in ZFS's arsenal of self-healing, comes copy-on-write.

TO-DO
copy-on-write
Almost finished --Tawfic

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-15T04:15:02Z

Azemanci: /* ext4 */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems in order to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems.

In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller.

ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels of abstraction increase ZFS' flexibility and simplifies its management. [Z3 P2]. Lastly, with respect to flexibility at least, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU
objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.

ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. We will tackle the main strategies followed by ZFS.

These can be summed as : checksumming, copy-on-write, and the use of transactions.

Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency thus is assured since updates are done atomically.

To self-heal a corrupted block, ZFS uses checksumming. It helps here to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock [Z1. P8].
Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption.

If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another
location and correct ( heal ) the related block. [Z1. P7-9].

Lastly, in ZFS's arsenal of self-healing, comes copy-on-write.

TO-DO
copy-on-write
Almost finished --Tawfic

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4]

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-14T21:52:11Z

Azemanci: /* References */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so
that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system.
In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels
of abstraction increase ZFS' flexibility and simplifies the management of a file system. [Z3 P2].

TO-DO:
ZPL and the common interface
TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2. These file systems were designed for users who had less and smaller storage devices than of today.The average worker/user would not have many files stored on their hard drive, and because of small amounts of data that might not be accessed as often as thought, the file systems did not worry too much about the procedures to repair data integrity(repairing the file system and relocating files).
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4]

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

*[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

*[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

*[3] http://support.microsoft.com/kb/251186

*[4] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-14T20:15:51Z

Azemanci: /* Current File Systems */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so
that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system.
In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels
of abstraction increase ZFS' flexibility and simplifies the management of a file system. [Z3 P2].

TO-DO:
ZPL and the common interface
TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4]

====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

[3] http://support.microsoft.com/kb/251186

[4] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

Talk:COMP 3000 Essay 1 2010 Question 9

2010-10-14T20:05:45Z

Azemanci: /* Sources */

== Contacts / If interested ==
Tawfic : tfatah@gmail.com

Andy Zemancik: andy.zemancik@gmail.com

Lester Mundt: lmundt@gmail.com

Matthew Chou : mateh.cc@gmail.com (this is mchou2)

Nisrin Abou-Seido: naseido@connect.carleton.ca

== Suggested References Format ==
Author, publisher/university, Name of the article

== Who is doing what ==
Suggestion: In order to avoid duplication. Please state what section/item you're currently working on.

Tawfic : Currently working on Section One ZFS.

Azemanci: Currently working on Section Three Current File Systems.

== Deadline ==
Suggestion: Adding content should stop on Thursday, October 14'th at 3:00 PM. Any work after that
should go into formatting, spelling, and grammar checking.

--[[User:Lmundt|Lmundt]] 15:00, 14 October 2010 (UTC)
- I will definitely be adding content after this time probably late, late into the evening.

--[[User:Tafatah|Tafatah]] 19:25, 14 October 2010 (UTC) No problem. Forget about the suggested deadline. I thought we'd have to be done by 11:00Pm.
I am still adding stuff myself. I think Anil will lock the Wiki around 7:00 Am or so. So anytime
before that is Ok.

== Essay Format Take 2 ==
Hello. I am suggesting the following format instead. If you agree, I'll take care of merging the existing info into this new format. My feeling is that this format is
more flexible and will (hopefully) allow individuals to take a section or a sub-section and work on it.

* '''Abstract'''
TO-DO: Main point. Current File Systems are neither versatile enough nor intelligent to handle the rapidly
growing needs of dynamic storage.

TO-DO: few statements regarding the WHYS as to the need for versatile storage (e.g. cloud computing, mobile environments, shifting consumer
demand . . etc )

TO-DO: few statements regarding the need for intelligence (just statements, the body will take care of expanding on these ). E.g. more
intelligent FS’s can include Metadata to help crime investigators, smart FS’s could be self healing . . .etc.

* '''Traditional File Systems'''
** '''Characteristics'''
** '''Limitations'''

* '''Zettabyte File System'''
** '''Characteristics'''
** '''Dissected'''
TO-DO: List the seven components of ZFS and basically what makes a ZFS
E.g. interface, various parts, and external needed libraries . . etc.

** '''Features Beyond Traditional File Systems'''

** '''Possible Real-Life Scenarios / Examples'''
TO-DO: 2-3 examples where ZFS was/could/is being considered for use.

TO-DO : One to two paragraphs stressing / reiterating the main points made in the abstract
thesis statement).

* '''Alternatives to ZFS'''
one example is good enough.
TO-DO: a brief description of the alternative.
Main argument for it’s viability.

** '''Pros/Cons'''
TO-DO: just a list of pluses and minuses

TO-DO : two to three paragraphs summarizing (this is the conclusion) the main points outlined in the abstract and the body, restating why traditional
FS’s are no longer viable, and stressing once more that ZFS is a valid alternative.

== Essay Format ==

I started working on the main page. The bullets are to be expanded. Other group are are working in their respective discussion pages but I think it's all right to put our work in progress on the front page. Thoughts?--[[User:Lmundt|Lmundt]] 16:14, 6 October 2010 (UTC)
* [[User:Gbint|Gbint]] 02:03, 7 October 2010 (UTC) Lmundt; what do you think of listing the capacities of the file system under major features? I was thinking that we could overview the features in brief, then delve into each one individually.
* --[[User:Lmundt|Lmundt]] 14:31, 7 October 2010 (UTC) I was thinking about the major structure... I like what your suggesting in one section. So here is the structure I am thinking of.

* Intro
* Section One ZFS
** Major feature 1
** Major feature 2
** Major feature 3
* Section Two Legacy File Systems
** Legacy File System1( FAT32 ) - what it does
** Legacy File System2( ext2 ) - what it does
** Contrast them with ZFS
* Section Three Current File Systems
** NTFS?
** ext4?
** Contrast them with ZFS
* Section Four future file Systems
** BTRFS
** WinFS or ??
** Contrast them with ZFS
* Conclusion

What does everyone think of this format? While everyone should contribute to section one we could divvy up the rest.

[[User:Gbint|Gbint]] 16:29, 9 October 2010 (UTC) The layout looks good; I filled out the data dedup section. I think it has reasonable coverage while staying away from becoming it's own essay just on deduplication.

The legacy file systems are really not even in the same world as ZFS, so I think the contrasting section should cover a lot of how storage needs have changed.

The current file systems are capable of being expanded into large pools of storage with good redundancy and even advanced features like data deduplication, but they are only a component in a chain of tools (like ext4 + lvm + mdraid + opendedup) rather than an full end-to-end solution.

--[[User:Lmundt|Lmundt]] 23:35, 9 October 2010 (UTC) The section on deduplication looks good I agree it looks like the right amount of coverage for a portion of a single section. Your also right about the old file systems not being able to hold a candle to ZFS and the conclusion section should talk about how storage needs and computers changed. And intro to that section could set the stage for that period as well. Non-multi-threaded, single processor system with much smaller RAM, even the applications were radically different the Internet was just single webpages without the high performance needs of web commerce and online banking for example. I have another assignment so won't be contributing too much until Monday.

--[[User:Tafatah|Tafatah]] 23:54, 10 October 2010 (UTC)
Please take a look at suggested essay format #2 and let me know soon. Time is running out Gents and Ladies :)

--[[User:Lmundt|Lmundt]] 15:35, 11 October 2010 (UTC)
I think I prefer the outline I proposed only because it's a very regimented contrast/compare essay format and should get us any marks for format. Most proper essays don't usually have a dedicated pros cons list. Heading more towards a report format I think. It's really what everyone agrees on. I won't be touching the essay until tomorrow though.

--[[User:Azemanci|Azemanci]] 17:32, 11 October 2010 (UTC)
I like Lmundt's outline. How would you like to divide up the work? Also can everyone post the contact information so we know exactly who is in our group.

--[[User:Tafatah|Tafatah]] 19:03, 11 October 2010 (UTC)
No problem, I'll go with the current format. One issue to keep in mind is that this is an essay, not a report. I.E. the intro/thesis has to include
a reasonable suggestion towards using ZFS as a reliable FS. The body and the conclusion would have to assert that. The current format satisfies that
if we keep these points in mind. I started looking into the "dissect subsection" in the format I suggested, which is related to the ZFS features
section one in the current format. I'll continue to look into that part (above section, who is doing what will be updated accordingly), i.e. I'll
take care of section one since I've already done some work on it. I suggest that each member of the group picks two items from one of the other
sections, except the contrasting part. Content in section one can then be used to finalize the comparisons in each of sections 2-4. The Intro/Abstract
and conclusion sections can be left to the end, and can be done collaboratively. I.E. once we have a very clear picture of all the
different pieces.

--[[User:Azemanci|Azemanci]] 03:18, 12 October 2010 (UTC)
I will begin working on section three current File Systems unless someone else has already begun working on it.

--[[User:Mchou2|Mchou2]] 20:29, 12 October 2010 (UTC)
I am going to start researching for section 2.

--[[User:Azemanci|Azemanci]] 03:15, 13 October 2010 (UTC) Alright so all the sections are being taken care of so we should be good to go for Thursday.

--[[User:Tafatah|Tafatah]] 04:35, 13 October 2010 (UTC) '''No one is assigned to section four''' ? Also, for those who haven't picked any section or subsection, please help out with the sections you're
more familiar with.

Finally, if you were in class today (well, technically yesterday), then you've heard Anil talk about plagiarism. I know this is common knowledge, so forgive
the annoying reminder. Please never copy and paste, and make sure to cite your info. As Anil mentioned, if anyone plagiarises, we are ALL responsible. It is
simply impossible for the rest of the group to check whether every member's sentence is genuine or not. So use your own words/phrases ( doesn't
have to be fancy or sophisticated ). If you're not sure, please check with the rest of the group.

Good luck, and good night.
--Tawfic

--[[User:Azemanci|Azemanci]] 14:55, 13 October 2010 (UTC) My bad I misread something I thought you were doing current file systems section 3. I'll take section 3 but then someone needs to do section 4. There are 4 of us so this should not be a problem.

--[[User:Naseido|Naseido]] 13 October 2010 Sorry I haven't contributed till now. The outline looks great and I think we can spend most of the day tomorrow editing to make sure all the sections fit together like an essay. I'll be doing section 4.

--[[User:Tafatah|Tafatah]] 16:11, 13 October 2010 (UTC) Hi. In section 4 the most important one is BTRFS. More info on that and less info on the others is Ok.

--[[User:Mchou2|Mchou2]] 03:00, 14 October 2010 (UTC)
I have done what I can for the legacy file systems, if someone who doesn't have any particular job wouldn't mind going over it and correcting any errors they see. I am also not familiar with how to edit/format these wiki pages so I tried my best and if you want to change the layout then please do, I would assume after we complete our sections and collaborate them into 1 essay that the formatting will change. I simply put headings on each section just so it is easier to read.

--[[User:Tafatah|Tafatah]] 04:55, 14 October 2010 (UTC) A reference for wiki editing http://meta.wikimedia.org/wiki/Help:Editing

--[[User:Azemanci|Azemanci]] 18:42, 14 October 2010 (UTC) I'm not going to have my info posted by 3:00. Also how and where are we supposed to cite our sources?

--[[User:Tafatah|Tafatah]] 19:28, 14 October 2010 (UTC) No worries. You have till 7:00 Am ( or till Anil locks the Wiki down, though I wouldn't count on more than 7) Friday, Oct 15. For citing, I am using
this convention. Bla.....Bla [Z1. P3] means I am using info from page 3 of article labeled as Z1 in references section.

== Sources ==

Not from your group. Found a file which goes to the heart of your problem
[http://www.oracle.com/technetwork/server-storage/solaris/overview/zfs-14990
2.pdf ZFSDatasheet]
[[User:Gautam|Gautam]] 22:50, 5 October 2010 (UTC)

Thanks will take a look at that.--[[User:Lmundt|Lmundt]] 16:12, 6 October 2010 (UTC)

[[User:Gbint|Gbint]] 01:45, 7 October 2010 (UTC) paper from Sun engineers explaining why they came to build ZFS, the problems they wanted to solve:
* PDF: http://www.timwort.org/classp/200_HTML/docs/zfs_wp.pdf
* HTML: http://74.125.155.132/scholar?q=cache:6Ex3KbFo4lYJ:scholar.google.com/+zettabyte+file+system&hl=en&as_sdt=2000

Excellent article.[[User:Lmundt|Lmundt]] 14:24, 7 October 2010 (UTC)

Not too exciting but it looks like an easy read http://arstechnica.com/hardware/news/2008/03/past-present-future-file-systems.ars [[User:Lmundt|Lmundt]] 14:40, 7 October 2010 (UTC)

the [http://en.wikipedia.org/wiki/Comparison_of_file_systems wikipedia comparison] has some good tables, and if you click the various categories you can learn quite a bit about the various important features //not your group. [[User:Rift|Rift]] 18:56, 7 October 2010 (UTC)

Hey, I'm not from your group but I found this slideshow that was really handy in the assignment! http://www.slideshare.net/Clogeny/zfs-the-last-word-in-filesystems - nshires

------

Hey there. I'm not a member of your group. But you guys might want to look at this Wiki-page from the SolarisInternals website. I used it today for our assignment, a lot of interesting and in-depth breakdown of the ZFS file system: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Performance_Considerations

-- Munther

--[[User:Mchou2|Mchou2]] 03:56, 13 October 2010 (UTC) Good intro to understanding FAT FS
http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf

--[[User:Azemanci|Azemanci]] 18:49, 14 October 2010 (UTC)
Abit late but I found a comparison of current file systems including ZFS:
http://www.idt.mdh.se/kurser/ct3340/ht09/ADMINISTRATION/IRCSE09-submissions/ircse09_submission_16.pdf

http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-14T19:58:52Z

Azemanci: /* ext4 */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so
that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system.
In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels
of abstraction increase ZFS' flexibility and simplifies the management of a file system. [Z3 P2].

TO-DO:
ZPL and the common interface
TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4]

====Comparison====

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

[3] http://support.microsoft.com/kb/251186

[4] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-14T19:58:18Z

Azemanci: /* ext4 */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so
that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system.
In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels
of abstraction increase ZFS' flexibility and simplifies the management of a file system. [Z3 P2].

TO-DO:
ZPL and the common interface
TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [5] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [5]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[5]

====Comparison====

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

[3] http://support.microsoft.com/kb/251186

[4] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-14T19:51:02Z

Azemanci: /* ext4 */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so
that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system.
In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels
of abstraction increase ZFS' flexibility and simplifies the management of a file system. [Z3 P2].

TO-DO:
ZPL and the common interface
TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [5] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [5] Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[5]

====Comparison====

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

[3] http://support.microsoft.com/kb/251186

[4] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-14T19:50:37Z

Azemanci: /* References */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so
that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system.
In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels
of abstraction increase ZFS' flexibility and simplifies the management of a file system. [Z3 P2].

TO-DO:
ZPL and the common interface
TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [5] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [5] Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increase the maximum volume size to 1EB up from the 16TB max of ext3.[5]

====Comparison====

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

[3] http://support.microsoft.com/kb/251186

[4] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-14T19:50:27Z

Azemanci: /* References */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so
that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system.
In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels
of abstraction increase ZFS' flexibility and simplifies the management of a file system. [Z3 P2].

TO-DO:
ZPL and the common interface
TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [5] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [5] Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increase the maximum volume size to 1EB up from the 16TB max of ext3.[5]

====Comparison====

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

[3] http://support.microsoft.com/kb/251186

[5] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-14T19:50:04Z

Azemanci: /* ext4 */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so
that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system.
In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels
of abstraction increase ZFS' flexibility and simplifies the management of a file system. [Z3 P2].

TO-DO:
ZPL and the common interface
TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [5] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [5] Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increase the maximum volume size to 1EB up from the 16TB max of ext3.[5]

====Comparison====

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

[3] http://support.microsoft.com/kb/251186

COMP 3000 Essay 1 2010 Question 9

2010-10-14T19:38:02Z

Azemanci: /* References */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object
in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so
that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system.
In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels
of abstraction increase ZFS' flexibility and simplifies the management of a file system. [Z3 P2].

TO-DO:
ZPL and the common interface
TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====

====Comparison====

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

[3] http://support.microsoft.com/kb/251186

COMP 3000 Essay 1 2010 Question 9

2010-10-14T19:09:24Z

Azemanci: /* References */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects.

TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====

====Comparison====

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

[1]http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

[3] http://support.microsoft.com/kb/251186

COMP 3000 Essay 1 2010 Question 9

2010-10-14T19:09:13Z

Azemanci: /* References */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects.

TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====

====Comparison====

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

[1]http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx
[2] http://technet.microsoft.com/en-us/library/cc938919.aspx
[3] http://support.microsoft.com/kb/251186

COMP 3000 Essay 1 2010 Question 9

2010-10-14T19:07:05Z

Azemanci: /* NTFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects.

TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

====ext4====

====Comparison====

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

COMP 3000 Essay 1 2010 Question 9

2010-10-14T19:05:39Z

Azemanci: /* NTFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects.

TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[4]

====ext4====

====Comparison====

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

COMP 3000 Essay 1 2010 Question 9

2010-10-14T18:52:37Z

Azemanci: /* NTFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system,
i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence,
the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the
traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev
would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects.

TO-DO:
How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write
checksumming
Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

== '''Current File Systems''' ==

====NTFS====

====ext4====

====Comparison====

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

Talk:COMP 3000 Essay 1 2010 Question 9

2010-10-14T18:49:45Z

Azemanci: /* Sources */

== Contacts / If interested ==
Tawfic : tfatah@gmail.com

Andy Zemancik: andy.zemancik@gmail.com

Lester Mundt: lmundt@gmail.com

Matthew Chou : mateh.cc@gmail.com (this is mchou2)

Nisrin Abou-Seido: naseido@connect.carleton.ca

== Suggested References Format ==
Author, publisher/university, Name of the article

== Who is doing what ==
Suggestion: In order to avoid duplication. Please state what section/item you're currently working on.

Tawfic : Currently working on Section One ZFS.

Azemanci: Currently working on Section Three Current File Systems.

== Deadline ==
Suggestion: Adding content should stop on Thursday, October 14'th at 3:00 PM. Any work after that
should go into formatting, spelling, and grammar checking.

--[[User:Lmundt|Lmundt]] 15:00, 14 October 2010 (UTC)
- I will definitely be adding content after this time probably late, late into the evening.

== Essay Format Take 2 ==
Hello. I am suggesting the following format instead. If you agree, I'll take care of merging the existing info into this new format. My feeling is that this format is
more flexible and will (hopefully) allow individuals to take a section or a sub-section and work on it.

* '''Abstract'''
TO-DO: Main point. Current File Systems are neither versatile enough nor intelligent to handle the rapidly
growing needs of dynamic storage.

TO-DO: few statements regarding the WHYS as to the need for versatile storage (e.g. cloud computing, mobile environments, shifting consumer
demand . . etc )

TO-DO: few statements regarding the need for intelligence (just statements, the body will take care of expanding on these ). E.g. more
intelligent FS’s can include Metadata to help crime investigators, smart FS’s could be self healing . . .etc.

* '''Traditional File Systems'''
** '''Characteristics'''
** '''Limitations'''

* '''Zettabyte File System'''
** '''Characteristics'''
** '''Dissected'''
TO-DO: List the seven components of ZFS and basically what makes a ZFS
E.g. interface, various parts, and external needed libraries . . etc.

** '''Features Beyond Traditional File Systems'''

** '''Possible Real-Life Scenarios / Examples'''
TO-DO: 2-3 examples where ZFS was/could/is being considered for use.

TO-DO : One to two paragraphs stressing / reiterating the main points made in the abstract
thesis statement).

* '''Alternatives to ZFS'''
one example is good enough.
TO-DO: a brief description of the alternative.
Main argument for it’s viability.

** '''Pros/Cons'''
TO-DO: just a list of pluses and minuses

TO-DO : two to three paragraphs summarizing (this is the conclusion) the main points outlined in the abstract and the body, restating why traditional
FS’s are no longer viable, and stressing once more that ZFS is a valid alternative.

== Essay Format ==

I started working on the main page. The bullets are to be expanded. Other group are are working in their respective discussion pages but I think it's all right to put our work in progress on the front page. Thoughts?--[[User:Lmundt|Lmundt]] 16:14, 6 October 2010 (UTC)
* [[User:Gbint|Gbint]] 02:03, 7 October 2010 (UTC) Lmundt; what do you think of listing the capacities of the file system under major features? I was thinking that we could overview the features in brief, then delve into each one individually.
* --[[User:Lmundt|Lmundt]] 14:31, 7 October 2010 (UTC) I was thinking about the major structure... I like what your suggesting in one section. So here is the structure I am thinking of.

* Intro
* Section One ZFS
** Major feature 1
** Major feature 2
** Major feature 3
* Section Two Legacy File Systems
** Legacy File System1( FAT32 ) - what it does
** Legacy File System2( ext2 ) - what it does
** Contrast them with ZFS
* Section Three Current File Systems
** NTFS?
** ext4?
** Contrast them with ZFS
* Section Four future file Systems
** BTRFS
** WinFS or ??
** Contrast them with ZFS
* Conclusion

What does everyone think of this format? While everyone should contribute to section one we could divvy up the rest.

[[User:Gbint|Gbint]] 16:29, 9 October 2010 (UTC) The layout looks good; I filled out the data dedup section. I think it has reasonable coverage while staying away from becoming it's own essay just on deduplication.

The legacy file systems are really not even in the same world as ZFS, so I think the contrasting section should cover a lot of how storage needs have changed.

The current file systems are capable of being expanded into large pools of storage with good redundancy and even advanced features like data deduplication, but they are only a component in a chain of tools (like ext4 + lvm + mdraid + opendedup) rather than an full end-to-end solution.

--[[User:Lmundt|Lmundt]] 23:35, 9 October 2010 (UTC) The section on deduplication looks good I agree it looks like the right amount of coverage for a portion of a single section. Your also right about the old file systems not being able to hold a candle to ZFS and the conclusion section should talk about how storage needs and computers changed. And intro to that section could set the stage for that period as well. Non-multi-threaded, single processor system with much smaller RAM, even the applications were radically different the Internet was just single webpages without the high performance needs of web commerce and online banking for example. I have another assignment so won't be contributing too much until Monday.

--[[User:Tafatah|Tafatah]] 23:54, 10 October 2010 (UTC)
Please take a look at suggested essay format #2 and let me know soon. Time is running out Gents and Ladies :)

--[[User:Lmundt|Lmundt]] 15:35, 11 October 2010 (UTC)
I think I prefer the outline I proposed only because it's a very regimented contrast/compare essay format and should get us any marks for format. Most proper essays don't usually have a dedicated pros cons list. Heading more towards a report format I think. It's really what everyone agrees on. I won't be touching the essay until tomorrow though.

--[[User:Azemanci|Azemanci]] 17:32, 11 October 2010 (UTC)
I like Lmundt's outline. How would you like to divide up the work? Also can everyone post the contact information so we know exactly who is in our group.

--[[User:Tafatah|Tafatah]] 19:03, 11 October 2010 (UTC)
No problem, I'll go with the current format. One issue to keep in mind is that this is an essay, not a report. I.E. the intro/thesis has to include
a reasonable suggestion towards using ZFS as a reliable FS. The body and the conclusion would have to assert that. The current format satisfies that
if we keep these points in mind. I started looking into the "dissect subsection" in the format I suggested, which is related to the ZFS features
section one in the current format. I'll continue to look into that part (above section, who is doing what will be updated accordingly), i.e. I'll
take care of section one since I've already done some work on it. I suggest that each member of the group picks two items from one of the other
sections, except the contrasting part. Content in section one can then be used to finalize the comparisons in each of sections 2-4. The Intro/Abstract
and conclusion sections can be left to the end, and can be done collaboratively. I.E. once we have a very clear picture of all the
different pieces.

--[[User:Azemanci|Azemanci]] 03:18, 12 October 2010 (UTC)
I will begin working on section three current File Systems unless someone else has already begun working on it.

--[[User:Mchou2|Mchou2]] 20:29, 12 October 2010 (UTC)
I am going to start researching for section 2.

--[[User:Azemanci|Azemanci]] 03:15, 13 October 2010 (UTC) Alright so all the sections are being taken care of so we should be good to go for Thursday.

--[[User:Tafatah|Tafatah]] 04:35, 13 October 2010 (UTC) '''No one is assigned to section four''' ? Also, for those who haven't picked any section or subsection, please help out with the sections you're
more familiar with.

Finally, if you were in class today (well, technically yesterday), then you've heard Anil talk about plagiarism. I know this is common knowledge, so forgive
the annoying reminder. Please never copy and paste, and make sure to cite your info. As Anil mentioned, if anyone plagiarises, we are ALL responsible. It is
simply impossible for the rest of the group to check whether every member's sentence is genuine or not. So use your own words/phrases ( doesn't
have to be fancy or sophisticated ). If you're not sure, please check with the rest of the group.

Good luck, and good night.
--Tawfic

--[[User:Azemanci|Azemanci]] 14:55, 13 October 2010 (UTC) My bad I misread something I thought you were doing current file systems section 3. I'll take section 3 but then someone needs to do section 4. There are 4 of us so this should not be a problem.

--[[User:Naseido|Naseido]] 13 October 2010 Sorry I haven't contributed till now. The outline looks great and I think we can spend most of the day tomorrow editing to make sure all the sections fit together like an essay. I'll be doing section 4.

--[[User:Tafatah|Tafatah]] 16:11, 13 October 2010 (UTC) Hi. In section 4 the most important one is BTRFS. More info on that and less info on the others is Ok.

--[[User:Mchou2|Mchou2]] 03:00, 14 October 2010 (UTC)
I have done what I can for the legacy file systems, if someone who doesn't have any particular job wouldn't mind going over it and correcting any errors they see. I am also not familiar with how to edit/format these wiki pages so I tried my best and if you want to change the layout then please do, I would assume after we complete our sections and collaborate them into 1 essay that the formatting will change. I simply put headings on each section just so it is easier to read.

--[[User:Tafatah|Tafatah]] 04:55, 14 October 2010 (UTC) A reference for wiki editing http://meta.wikimedia.org/wiki/Help:Editing

--[[User:Azemanci|Azemanci]] 18:42, 14 October 2010 (UTC) I'm not going to have my info posted by 3:00. Also how and where are we supposed to cite our sources?

== Sources ==

Not from your group. Found a file which goes to the heart of your problem
[http://www.oracle.com/technetwork/server-storage/solaris/overview/zfs-14990
2.pdf ZFSDatasheet]
[[User:Gautam|Gautam]] 22:50, 5 October 2010 (UTC)

Thanks will take a look at that.--[[User:Lmundt|Lmundt]] 16:12, 6 October 2010 (UTC)

[[User:Gbint|Gbint]] 01:45, 7 October 2010 (UTC) paper from Sun engineers explaining why they came to build ZFS, the problems they wanted to solve:
* PDF: http://www.timwort.org/classp/200_HTML/docs/zfs_wp.pdf
* HTML: http://74.125.155.132/scholar?q=cache:6Ex3KbFo4lYJ:scholar.google.com/+zettabyte+file+system&hl=en&as_sdt=2000

Excellent article.[[User:Lmundt|Lmundt]] 14:24, 7 October 2010 (UTC)

Not too exciting but it looks like an easy read http://arstechnica.com/hardware/news/2008/03/past-present-future-file-systems.ars [[User:Lmundt|Lmundt]] 14:40, 7 October 2010 (UTC)

the [http://en.wikipedia.org/wiki/Comparison_of_file_systems wikipedia comparison] has some good tables, and if you click the various categories you can learn quite a bit about the various important features //not your group. [[User:Rift|Rift]] 18:56, 7 October 2010 (UTC)

Hey, I'm not from your group but I found this slideshow that was really handy in the assignment! http://www.slideshare.net/Clogeny/zfs-the-last-word-in-filesystems - nshires

------

Hey there. I'm not a member of your group. But you guys might want to look at this Wiki-page from the SolarisInternals website. I used it today for our assignment, a lot of interesting and in-depth breakdown of the ZFS file system: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Performance_Considerations

-- Munther

--[[User:Mchou2|Mchou2]] 03:56, 13 October 2010 (UTC) Good intro to understanding FAT FS
http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf

--[[User:Azemanci|Azemanci]] 18:49, 14 October 2010 (UTC)
Abit late but I found a comparison of current file systems including ZFS:
http://www.idt.mdh.se/kurser/ct3340/ht09/ADMINISTRATION/IRCSE09-submissions/ircse09_submission_16.pdf

Talk:COMP 3000 Essay 1 2010 Question 9

2010-10-14T18:42:13Z

Azemanci: /* Essay Format */

== Contacts / If interested ==
Tawfic : tfatah@gmail.com

Andy Zemancik: andy.zemancik@gmail.com

Lester Mundt: lmundt@gmail.com

Matthew Chou : mateh.cc@gmail.com (this is mchou2)

Nisrin Abou-Seido: naseido@connect.carleton.ca

== Suggested References Format ==
Author, publisher/university, Name of the article

== Who is doing what ==
Suggestion: In order to avoid duplication. Please state what section/item you're currently working on.

Tawfic : Currently working on Section One ZFS.

Azemanci: Currently working on Section Three Current File Systems.

== Deadline ==
Suggestion: Adding content should stop on Thursday, October 14'th at 3:00 PM. Any work after that
should go into formatting, spelling, and grammar checking.

--[[User:Lmundt|Lmundt]] 15:00, 14 October 2010 (UTC)
- I will definitely be adding content after this time probably late, late into the evening.

== Essay Format Take 2 ==
Hello. I am suggesting the following format instead. If you agree, I'll take care of merging the existing info into this new format. My feeling is that this format is
more flexible and will (hopefully) allow individuals to take a section or a sub-section and work on it.

* '''Abstract'''
TO-DO: Main point. Current File Systems are neither versatile enough nor intelligent to handle the rapidly
growing needs of dynamic storage.

TO-DO: few statements regarding the WHYS as to the need for versatile storage (e.g. cloud computing, mobile environments, shifting consumer
demand . . etc )

TO-DO: few statements regarding the need for intelligence (just statements, the body will take care of expanding on these ). E.g. more
intelligent FS’s can include Metadata to help crime investigators, smart FS’s could be self healing . . .etc.

* '''Traditional File Systems'''
** '''Characteristics'''
** '''Limitations'''

* '''Zettabyte File System'''
** '''Characteristics'''
** '''Dissected'''
TO-DO: List the seven components of ZFS and basically what makes a ZFS
E.g. interface, various parts, and external needed libraries . . etc.

** '''Features Beyond Traditional File Systems'''

** '''Possible Real-Life Scenarios / Examples'''
TO-DO: 2-3 examples where ZFS was/could/is being considered for use.

TO-DO : One to two paragraphs stressing / reiterating the main points made in the abstract
thesis statement).

* '''Alternatives to ZFS'''
one example is good enough.
TO-DO: a brief description of the alternative.
Main argument for it’s viability.

** '''Pros/Cons'''
TO-DO: just a list of pluses and minuses

TO-DO : two to three paragraphs summarizing (this is the conclusion) the main points outlined in the abstract and the body, restating why traditional
FS’s are no longer viable, and stressing once more that ZFS is a valid alternative.

== Essay Format ==

I started working on the main page. The bullets are to be expanded. Other group are are working in their respective discussion pages but I think it's all right to put our work in progress on the front page. Thoughts?--[[User:Lmundt|Lmundt]] 16:14, 6 October 2010 (UTC)
* [[User:Gbint|Gbint]] 02:03, 7 October 2010 (UTC) Lmundt; what do you think of listing the capacities of the file system under major features? I was thinking that we could overview the features in brief, then delve into each one individually.
* --[[User:Lmundt|Lmundt]] 14:31, 7 October 2010 (UTC) I was thinking about the major structure... I like what your suggesting in one section. So here is the structure I am thinking of.

* Intro
* Section One ZFS
** Major feature 1
** Major feature 2
** Major feature 3
* Section Two Legacy File Systems
** Legacy File System1( FAT32 ) - what it does
** Legacy File System2( ext2 ) - what it does
** Contrast them with ZFS
* Section Three Current File Systems
** NTFS?
** ext4?
** Contrast them with ZFS
* Section Four future file Systems
** BTRFS
** WinFS or ??
** Contrast them with ZFS
* Conclusion

What does everyone think of this format? While everyone should contribute to section one we could divvy up the rest.

[[User:Gbint|Gbint]] 16:29, 9 October 2010 (UTC) The layout looks good; I filled out the data dedup section. I think it has reasonable coverage while staying away from becoming it's own essay just on deduplication.

The legacy file systems are really not even in the same world as ZFS, so I think the contrasting section should cover a lot of how storage needs have changed.

The current file systems are capable of being expanded into large pools of storage with good redundancy and even advanced features like data deduplication, but they are only a component in a chain of tools (like ext4 + lvm + mdraid + opendedup) rather than an full end-to-end solution.

--[[User:Lmundt|Lmundt]] 23:35, 9 October 2010 (UTC) The section on deduplication looks good I agree it looks like the right amount of coverage for a portion of a single section. Your also right about the old file systems not being able to hold a candle to ZFS and the conclusion section should talk about how storage needs and computers changed. And intro to that section could set the stage for that period as well. Non-multi-threaded, single processor system with much smaller RAM, even the applications were radically different the Internet was just single webpages without the high performance needs of web commerce and online banking for example. I have another assignment so won't be contributing too much until Monday.

--[[User:Tafatah|Tafatah]] 23:54, 10 October 2010 (UTC)
Please take a look at suggested essay format #2 and let me know soon. Time is running out Gents and Ladies :)

--[[User:Lmundt|Lmundt]] 15:35, 11 October 2010 (UTC)
I think I prefer the outline I proposed only because it's a very regimented contrast/compare essay format and should get us any marks for format. Most proper essays don't usually have a dedicated pros cons list. Heading more towards a report format I think. It's really what everyone agrees on. I won't be touching the essay until tomorrow though.

--[[User:Azemanci|Azemanci]] 17:32, 11 October 2010 (UTC)
I like Lmundt's outline. How would you like to divide up the work? Also can everyone post the contact information so we know exactly who is in our group.

--[[User:Tafatah|Tafatah]] 19:03, 11 October 2010 (UTC)
No problem, I'll go with the current format. One issue to keep in mind is that this is an essay, not a report. I.E. the intro/thesis has to include
a reasonable suggestion towards using ZFS as a reliable FS. The body and the conclusion would have to assert that. The current format satisfies that
if we keep these points in mind. I started looking into the "dissect subsection" in the format I suggested, which is related to the ZFS features
section one in the current format. I'll continue to look into that part (above section, who is doing what will be updated accordingly), i.e. I'll
take care of section one since I've already done some work on it. I suggest that each member of the group picks two items from one of the other
sections, except the contrasting part. Content in section one can then be used to finalize the comparisons in each of sections 2-4. The Intro/Abstract
and conclusion sections can be left to the end, and can be done collaboratively. I.E. once we have a very clear picture of all the
different pieces.

--[[User:Azemanci|Azemanci]] 03:18, 12 October 2010 (UTC)
I will begin working on section three current File Systems unless someone else has already begun working on it.

--[[User:Mchou2|Mchou2]] 20:29, 12 October 2010 (UTC)
I am going to start researching for section 2.

--[[User:Azemanci|Azemanci]] 03:15, 13 October 2010 (UTC) Alright so all the sections are being taken care of so we should be good to go for Thursday.

--[[User:Tafatah|Tafatah]] 04:35, 13 October 2010 (UTC) '''No one is assigned to section four''' ? Also, for those who haven't picked any section or subsection, please help out with the sections you're
more familiar with.

Finally, if you were in class today (well, technically yesterday), then you've heard Anil talk about plagiarism. I know this is common knowledge, so forgive
the annoying reminder. Please never copy and paste, and make sure to cite your info. As Anil mentioned, if anyone plagiarises, we are ALL responsible. It is
simply impossible for the rest of the group to check whether every member's sentence is genuine or not. So use your own words/phrases ( doesn't
have to be fancy or sophisticated ). If you're not sure, please check with the rest of the group.

Good luck, and good night.
--Tawfic

--[[User:Azemanci|Azemanci]] 14:55, 13 October 2010 (UTC) My bad I misread something I thought you were doing current file systems section 3. I'll take section 3 but then someone needs to do section 4. There are 4 of us so this should not be a problem.

--[[User:Naseido|Naseido]] 13 October 2010 Sorry I haven't contributed till now. The outline looks great and I think we can spend most of the day tomorrow editing to make sure all the sections fit together like an essay. I'll be doing section 4.

--[[User:Tafatah|Tafatah]] 16:11, 13 October 2010 (UTC) Hi. In section 4 the most important one is BTRFS. More info on that and less info on the others is Ok.

--[[User:Mchou2|Mchou2]] 03:00, 14 October 2010 (UTC)
I have done what I can for the legacy file systems, if someone who doesn't have any particular job wouldn't mind going over it and correcting any errors they see. I am also not familiar with how to edit/format these wiki pages so I tried my best and if you want to change the layout then please do, I would assume after we complete our sections and collaborate them into 1 essay that the formatting will change. I simply put headings on each section just so it is easier to read.

--[[User:Tafatah|Tafatah]] 04:55, 14 October 2010 (UTC) A reference for wiki editing http://meta.wikimedia.org/wiki/Help:Editing

--[[User:Azemanci|Azemanci]] 18:42, 14 October 2010 (UTC) I'm not going to have my info posted by 3:00. Also how and where are we supposed to cite our sources?

== Sources ==

Not from your group. Found a file which goes to the heart of your problem
[http://www.oracle.com/technetwork/server-storage/solaris/overview/zfs-14990
2.pdf ZFSDatasheet]
[[User:Gautam|Gautam]] 22:50, 5 October 2010 (UTC)

Thanks will take a look at that.--[[User:Lmundt|Lmundt]] 16:12, 6 October 2010 (UTC)

[[User:Gbint|Gbint]] 01:45, 7 October 2010 (UTC) paper from Sun engineers explaining why they came to build ZFS, the problems they wanted to solve:
* PDF: http://www.timwort.org/classp/200_HTML/docs/zfs_wp.pdf
* HTML: http://74.125.155.132/scholar?q=cache:6Ex3KbFo4lYJ:scholar.google.com/+zettabyte+file+system&hl=en&as_sdt=2000

Excellent article.[[User:Lmundt|Lmundt]] 14:24, 7 October 2010 (UTC)

Not too exciting but it looks like an easy read http://arstechnica.com/hardware/news/2008/03/past-present-future-file-systems.ars [[User:Lmundt|Lmundt]] 14:40, 7 October 2010 (UTC)

the [http://en.wikipedia.org/wiki/Comparison_of_file_systems wikipedia comparison] has some good tables, and if you click the various categories you can learn quite a bit about the various important features //not your group. [[User:Rift|Rift]] 18:56, 7 October 2010 (UTC)

Hey, I'm not from your group but I found this slideshow that was really handy in the assignment! http://www.slideshare.net/Clogeny/zfs-the-last-word-in-filesystems - nshires

------

Hey there. I'm not a member of your group. But you guys might want to look at this Wiki-page from the SolarisInternals website. I used it today for our assignment, a lot of interesting and in-depth breakdown of the ZFS file system: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Performance_Considerations

-- Munther

--[[User:Mchou2|Mchou2]] 03:56, 13 October 2010 (UTC) Good intro to understanding FAT FS
http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf

COMP 3000 Essay 1 2010 Question 9

2010-10-14T06:51:56Z

Azemanci: /* NTFS */

=Question=

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

=Answer=

== '''Introduction''' ==
TO-DO: Edit, expand, revise

ZFS was developed by Sun Microsystems (now owned by Oracle) as a server class file systems. This differs from most file systems which were developed as desktop file systems that could be used by servers. With the server being the target for the file system particular attention was paid to data integrity, size and speed.

One of the most significant ways in which the ZFS differs from traditional file systems is the level of abstraction. While a traditional file system abstracts away the physical properties of the media upon which it lies i.e. hard disk, flash drive, CD-ROM, etc. ZFS abstracts away if the file system lives one or many different pieces of hardware or media. Examples include a single hard drive, an array of hardrives, a number of hard drives on non co-located systems.

One of the mechanisms that allows this abstraction is that the volume manager which is normally a program separate from the file system in traditional file systems is moved into ZFS.

ZFS is a 128-bit file system allowing this allows addressing of 2128 bytes of storage.

== '''ZFS''' ==
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization
of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence, the entire system becomes simpler and easier to maintain.

TO-DO : Not finished yet --Tawfic

Advantages of pooled storage
# No partitions to manage.
# All free storage space is always available.
# Easy to grow/shrink.

Problems a ZFS attempts to tackle/avoid
Loosing important files
Running out of space on a partition
Booting with a damaged root file system.

Issues with existing File Systems
No way to prevent silent data corruptions
E.g. defects in a controller, disk, firmware . . etc can corrupt data silently.
Hard to manage
Limits on file sizes, number of files, files per directory..etc

In ZFS, the ideas of files and directories are replaced by objects.

ZFS is composed of seven components (achieving, and going beyond what
a TFS provides ):
# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

====Physical Layer Abstraction====

* volume management and file system all in one
* file systems on top of zpools on top of vdevs on top of physical devices
* file systems easily and often span over many physical devices.
* ridiculous capacity

====Data Integrity====

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

====Data Deduplication====

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

== '''Legacy File Systems''' ==
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
====FAT32====
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
==== Ext2 ====
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d
==== Comparison ====
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for the ext2 file system to save time and resources by not systematically going through a storage device. ZFS uses a volume manager that controls many other file systems, where as the FAT32 file system would only be able to manage its limited storage space on a device by having to create another FAT32 file system to manage other storage devices.

== '''Current File Systems''' ==

====NTFS====
New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector hold the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table. This ensures that if there is an error with the Master File Table the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. There are advantages to NTFS over other file systems; they are Recoverability Reliability, Compression and Security. NTFS implements something known as a change journal to allow the volume to be recoverable if there are an errors. The Change Journal is another recorded stored by the file system which records all the changes made to files and directories. Having the change journal also improves reliability as it can be used to correct errors in the volume. NTFS supports compression of files. Files can and are compressed in an effort to decrease the amount of space they require in the volume. With NTFS security was taken into account. Stored within the metadata in the MFT are permissions for each individual file which allows only individuals with the correct permissions to access the files. NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.

====ext4====

====Comparison====

== '''Future File Systems''' ==
====BTRFS====

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees
--source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

====WinFS====

== References ==

* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008

* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

*2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

*2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf].

*2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx].

*2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5].

*2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html].

*2.3f - ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor].

*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

COMP 3000 Essay 1 2010 Question 9

2010-10-14T06:49:44Z

Azemanci: /* NTFS */