COMP 3000 Essay 1 2010 Question 9: Difference between revisions
m Unprotected "COMP 3000 Essay 1 2010 Question 9" |
|||
(36 intermediate revisions by 5 users not shown) | |||
Line 6: | Line 6: | ||
== '''Introduction''' == | == '''Introduction''' == | ||
ZFS was developed by Sun Microsystems in order to | ZFS was developed by Sun Microsystems in order to tackle the problem of ever increasing storage needs, particularly in a server environment. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2] | ||
The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system. | The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system. | ||
Line 29: | Line 29: | ||
Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically. | Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically. | ||
To self-heal a corrupted block, ZFS uses checksumming. It to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another | To self-heal a corrupted block, ZFS uses checksumming. It is easier to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another | ||
location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1] | location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1] | ||
Line 57: | Line 57: | ||
== '''Legacy File Systems''' == | == '''Legacy File Systems''' == | ||
In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the development of current file systems and how they compare to a system such as ZFS. | |||
One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average user today. The average user at the time would not have had many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files). | |||
====FAT32==== | ====FAT32==== | ||
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. | When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files. | ||
==== Ext2 ==== | ==== Ext2 ==== | ||
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. | The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies. | ||
==== Comparison ==== | ==== Comparison ==== | ||
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing and retrieving data they cannot be reasonably compared to ZFS. | |||
== '''Current File Systems''' == | == '''Current File Systems''' == | ||
Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS. | |||
====NTFS==== | ====NTFS==== | ||
Line 81: | Line 89: | ||
== '''Future File Systems''' == | == '''Future File Systems''' == | ||
==== | The newest file systems are quickly closing in on ZFS and the forerunner of the pack is currently Btrfs. Designed with similar motivation as ZFS Btrfs provides a similarly rich feature set that is ideally suited for modern computing needs. | ||
====Btrfs==== | |||
Btrfs, the B-tree File System, started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS, its main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs. | |||
Btrfs, the | Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Btrfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and support of massive amounts of data. | ||
Btrfs | Btrfs is based on the b-tree structure. These trees are composed of three data types: keys, items and block headers. The data type of these trees is referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks. The first is a 64-bit object id, followed by 8-bits denoting the item’s type, and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others. | ||
The trees are constructed of these primitive items with interior nodes containing keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data. | |||
As a minimum the Btrfs file system contains three trees. First, it contains a tree that contains other tree roots. Second, it contains a subvolumes tree which holds files and directories and third it contains an extents volume tree that contains information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees that are added to support other features such as logs, chunks and data relocation. | |||
As a minimum the Btrfs | |||
The copy on write method of the system is a pivotal | The copy on write method of the system is a pivotal aspect of a number of features of the Btrfs. Writes never occur on the same blocks. A transaction log is created and writes are cached.Then, the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the file system. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity. | ||
Therefore, Btrfs provides a very simple underlying implementation and provides a number of features that help ensure this file system will remain useful in the future. | |||
====WinFS==== | ====WinFS==== | ||
WinFS (Windows Future Storage), although now a defunct project as it was cancelled in 2006, was the brainchild of Microsoft and deserves comparison with ZFS. While the other best in class file systems ( most notably Btrfs) all scramble to meet modern computing needs with an almost identical feature set to ZFS, WinFS was attempting a complete rethink of a file system. | |||
The two most notable features of WinFS was its database based design and it's peer-to-peer replication services. | |||
Unlike traditional file systems including cutting edge ZFS, WinFS was moving away from the hierarchical structure file systems regarding search. The file data was still going to reside in a hierarchical structure that was based on NTFS, but there was meta-data stored in a relational database. This would allow searches that transcended normal file meta-data like time stamp, name, text content. Meta-data could be added to any file type and resulting searches would return a collection of file types. Searches like "all the people I have pictures of while I was on my trip in Thailand and whose email address I have" would be understood. The query would return the pictures, email traffic, contact card from organizer software. | |||
Microsoft also recognized the increasing importance of being able to replicate and synchronize data between devices. WinFS from the ground up was designed with the ability to replicate data across thousands of computers. The synchronization algorithms were also intended to resolve conflicts with transparency to the user across potentially unreliable networks. | |||
====Comparison==== | ====Comparison==== | ||
Upon first inspection, Btrfs seems near identical to ZFS. However, Btrfs does lack some features of ZFS. First of all, Btrfs doesn't have the self-healing capability or data deduplication of ZFS. Second, ZFS also support more configurations of software RAID than Btrfs. However, since ZFS does have three more years of development it is still very possible for Btrfs to catch up. | |||
WinFS provided an interesting set of features, many of which are completely different than ZFS, because it approached the problem differently. While both were attempting to meet the needs of the modern computer, ZFS was more focused on server settings and WinFS seemed to be focused on the requirements of the home PC user. WinFS was not focused on performance or large memory storage needs and therefore it would not have been able to serve as a useful replacement for traditional file systems. | |||
These two file systems demonstrate that, while at this point no one system is capable of providing the full functionality and capabilities of ZFS, systems such as Btrfs are, at least, coming close. Perhaps someone will also take up WinFS and change it so it meets a wider needs profile. However, if we are dicussing a viable alternative to file systems today, ZFS is currently the best choice. | |||
== '''Conclusion''' == | == '''Conclusion''' == | ||
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX | ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX | ||
compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that | compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5]. | ||
was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5] | |||
The administrators were attracted to the storage pool concept | The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks and the problems with administrators continuously switching disks, performing partitions, copying old data to the new larger disks there was a significant amount of user downtime. Therefore, the administrators were attracted to the storage pool concept, and to the added flexibility provided by ZFS. | ||
Lastly, given the increasing demands not only on vast amounts of storage, but on the continuous availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities, any inefficiency in traditional file systems will soon, if it hasn't already, become extremely intolerable. | |||
Therefore, ZFS should be seriously considered as a smart file system solution for today and for the foreseeable future. | |||
== '''References''' == | == '''References''' == | ||
Line 128: | Line 148: | ||
* Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA. | * Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study]. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA. | ||
* | * S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3 | ||
* | * Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [http://www-ssdp.dee.fct.unl.pt/leec/micro/20042005/teorica/Introduction_to_FAT.pdf]. | ||
* | * Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [http://technet.microsoft.com/en-ca/magazine/2006.07.windowsconfidential.aspx]. | ||
* | * Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [http://proquest.safaribooksonline.com.proxy.library.carleton.ca/0321268172/ch14#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTAzMjEyNjgxNzIvMjg5]. | ||
* | * Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [http://e2fsprogs.sourceforge.net/ext2intro.html]. | ||
* | * ZFS FAQ - opensolaris [http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#whatstandfor]. | ||
*Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1] | * Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1] | ||
*Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2] | * Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2] | ||
*[ | *Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx]. | ||
*[ | *Microsoft-TechNet. "File Systems" [http://technet.microsoft.com/en-us/library/cc938919.aspx]. | ||
*[ | *Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [http://support.microsoft.com/kb/251186]. | ||
* | * Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07). | ||
new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07). | |||
* Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies | * Heger Dominique A., (Post 2007), [http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems"], DHTechnologies | ||
Line 157: | Line 176: | ||
* Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle | * Unaccredited, [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html "Btrfs Design"], Oracle | ||
* Chris | * Mason Chris, (2007), [http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf "The Btrfs Filesystem"], Oracle | ||
* Novik Lev, Irena Hudis, Douglas B. Terry, Sanjay Anand, Vivek J. Jhaveri, Ashish Shah, Yunxin Wu (2006), [http://research.microsoft.com/pubs/65604/tr-2006-78.pdf "Peer-to-Peer Replication in WinFS"], Microsoft Corporation | |||
* Rector Brent, (2004),[http://msdn.microsoft.com/en-us/library/aa479870.aspx "Chapter 4. Storage"], Wise Owl Consulting |
Latest revision as of 15:34, 8 November 2010
Question
What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)
Answer
Introduction
ZFS was developed by Sun Microsystems in order to tackle the problem of ever increasing storage needs, particularly in a server environment. Some of the motivations behind its development were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, ZFS was designed with the aim of avoiding some of the major problems associated with traditional file systems. In particular, it avoids possible data corruption, especially silent corruption, as well as the inability to expand and shrink storage dynamically, the inability to fix bad blocks automatically, a low level of abstractions and a lack of simple interfaces. [Z2] The growing needs of file storage are currently best met by the ZFS file system as the requirements that this system satisfies are not fully implemented by any other one system.
ZFS
ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization of storage, and the ability to self-repair. To understand the implementation of these requirements of ZFS, it is important to note that ZFS is made up of the following subsystems: SPA (Storage Pool Allocator), DSL (Data Set and snapshot Layer), DMU (Data Management Unit), ZAP (ZFS Attributes Processor), ZPL (ZFS POSIX Layer), ZIL (ZFS Intent Log) and ZVOL (ZFS Volume).
The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, that is, via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality. As a consequence, the entire system becomes simpler and easier to maintain.
Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, it can not be shared with other file systems.
In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of APIs used for allocating and freeing blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). However, instead of memory allocation, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVAs to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.
Virtual devices abstract virtual device drivers. A virtual device can be thought of as a node with possible children. Each child can be another virtual device or a device driver. The SPA also handles the traditional volume manager tasks such as mirroring. It accomplishes such tasks via the use of virtual devices. Each one implements a specific task. In this case, if SPA needed to handle mirroring, a virtual device would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.
The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. This is a significantly large amount of information which allows for a larger storage limit.
ZFS also uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one or more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, which in turn is a collection of blocks. Such levels of abstraction increase ZFS's flexibility and simplifies its management. Lastly, with respect to flexibility, the ZFS POSIX layer provides a POSIX compliant layer to manage DMU objects. This allows any software developed for a POSIX compliant file system to work seamlessly with ZFS.
ZPL also plays an important role in achieving data consistency, and maintaining its integrity. This brings us to the topic of how ZFS achieves self-healing. Its main strategies are checksumming, copy-on-write, and the use of transactions.
Starting with transactions, ZPL combines data write changes to objects and uses the DMU object transaction interface to perform the updates. [Z1.P9]. Consistency is therefore assured since updates are done atomically.
To self-heal a corrupted block, ZFS uses checksumming. It is easier to imagine blocks as part of a tree where the actual data resides at the leaves. The root node, in ZFS' terminology, is called the uberblock. Each block has a checksum which is maintained by the block parent's indirect block. The scheme of maintaining the checksum and the data separately reduces the probability of simultaneous corruption. If a write fails for whatever reason, the uberblock is able to detect the failure since it has access to the checksum of the corrupted block. The uberblock is able then to retrieve a backup from another location and correct ( heal ) the related block. Lastly, the DMU uses copy-on-write for all blocks in order to implement self-healing. Whenever a block needs to be modified, a new block is created, and the old block is then copied to the new block. Any pointers and / or indirect blocks are then modified traversing all the way up to the uberblock. [Z1]
The DMU thus ensures data integrity all the time. This is considered self-healing simply because it prevents major problems, such as silent data corruption, which are hard to detect.
Data Integrity
At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.
In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.
RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.
With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.
At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.
Data Deduplication
Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.
Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.
The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.
In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.
Legacy File Systems
In order to determine the needs and basic requirements of a file system, it is necessary to consider legacy file systems and how they have influenced the development of current file systems and how they compare to a system such as ZFS.
One such legacy file system is FAT32, and another is ext2. These file systems were designed for users who had much fewer and much smaller storage devices and storage needs than the average user today. The average user at the time would not have had many files stored on their hard drive, and because the small amounts of data were not accessed that often, the file systems did not need to worry much about the procedures for ensuring data integrity(repairing the file system and relocating files).
FAT32
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file. The first file on a new device will use all sequential clusters. Therefore, the first cluster will point to the second, which will point to the third and so on. The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file and the file is accessed the file system must find all clusters that go together that make up the file. This process takes long if the clusters are not organized. When files are deleted, the clusters are modified as well and leave empty clusters available for new data. Because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a de-fragmentation system, but all of the recent Windows OS’s come with a defragmentation tool for users to use. De-fragmentation allows for the storage device to organize the fragments of a file (clusters) so that they are near each other. This decreases the time it takes to access a file from the file system. Since this is not a default function in the FAT32 system, looking for empty space to store a file requires a linear search through all the clusters. Therefore, the lack of efficiency, the lack of sufficient storage space and the lack of data integrity preservation are all major drawbacks of FAT32. However, one feature of FAT32 is that the first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
Ext2
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group. Files in ext2 are represented by inodes. Inodes are a structure that contains the description of the file, the file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32, the file allocation table was used to define the organization of how file fragments were, and it was important to have duplicate copies of this FAT in case of a crash. For similar functionality, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group. Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. These backup copies are used when the system fails or shuts down suddenly and therefore requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.
Comparison
In general, when observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, while the ZFS contains 2^58 ZB -- an amount which isis incomparably larger. Also, ZFS has the ability to find and replace any bad data while the system is running which means that fsck is not used in ZFS, very much unlike the ext2 system. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, whereas the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. It is easy to see that although these legacy file systems fulfill the basic role of storing and retrieving data they cannot be reasonably compared to ZFS.
Current File Systems
Current file systems, on the other hand, are much more comparable to ZFS and actually do fulfill some of the same requirements. However, despite their current popularity and usage, they do not provide all of the functionality possible using ZFS.
NTFS
One system that is currently in widespread use is NTFS, the New Technology File System. This system was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The way it implements storage is that it creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy.
The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally, the Master File Table Copy is a copy of the MFT which ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. Therefore, the main advantages to NTFS is that it provides data recovery, data integrity and protection of data in case of interruption during writing.
Another advantage is that it allows for compression of files to save disk space However, this has a negative effect on performance because in order to move compressed files they must first be decompressed then transferred and recompressed. Another disadvantage is that NTFS does have certain volume and size constraints. In particular, NTFS is a 64-bit file system which allows for a maximum of 2^64 bytes of storage. It is also capped at a maximum file size of 16TB and a maximum volume size of 256TB. This is significantly less than the storage capability provided by ZFS.
ext4
Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. ext3 had a very large overhead when dealing with larger files. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4] The primary goal of ext4 was to increase the amount of storage possible.
Comparison
The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.
Future File Systems
The newest file systems are quickly closing in on ZFS and the forerunner of the pack is currently Btrfs. Designed with similar motivation as ZFS Btrfs provides a similarly rich feature set that is ideally suited for modern computing needs.
Btrfs
Btrfs, the B-tree File System, started by Oracle systems in 2007 , is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. Starting development just three years after ZFS, its main goals of efficiency, integrity and maintainability are clearly visible in the feature set of Btrfs.
Btrfs efficiently uses space with tight packing of small files, on the fly compression, and very fast search capability. To improve integrity Btrfs has checksums for writes and snapshots. Maintainability is supported with efficient incremental backups, live defragmentation, dynamic expansion of the file system to incorporate new devices, and support of massive amounts of data.
Btrfs is based on the b-tree structure. These trees are composed of three data types: keys, items and block headers. The data type of these trees is referred to as items and all are sorted on a 136-bit key. The key is divided into three chucks. The first is a 64-bit object id, followed by 8-bits denoting the item’s type, and finally the last 64 bits have distinct uses depending on the type. The unique key helps with quick searches using hash tables and is necessary for the sorting algorithms that help keep the tree balanced. The items contain a key and information on the size and location of the items data. Block headers contain various information about blocks such as checksum, level in the tree, generation number, owner, block number, flags, tree id and chunk id as well as a couple others.
The trees are constructed of these primitive items with interior nodes containing keys to identify the node and block pointers that point to the child of the node. Leaves contain multiple items and their data.
As a minimum the Btrfs file system contains three trees. First, it contains a tree that contains other tree roots. Second, it contains a subvolumes tree which holds files and directories and third it contains an extents volume tree that contains information about all the allocated extents files. Outside of the trees will be the extents, and a superblock. The superblock is a data structure that points to the root of roots. Btrfs can have additional trees that are added to support other features such as logs, chunks and data relocation.
The copy on write method of the system is a pivotal aspect of a number of features of the Btrfs. Writes never occur on the same blocks. A transaction log is created and writes are cached.Then, the file system allocates sufficient blocks for the new data and the new data is written there. All subvolumes are updated to the new blocks. The old blocks are then removed and freed at the discretion of the file system. This copy on write combined with the internal generation number allows the system to create snapshots of the data to be made. After each copy the checksum is also recalculated on a per block basis and a duplicate is made to another chunk. These actions combine to provide exceptional data integrity.
Therefore, Btrfs provides a very simple underlying implementation and provides a number of features that help ensure this file system will remain useful in the future.
WinFS
WinFS (Windows Future Storage), although now a defunct project as it was cancelled in 2006, was the brainchild of Microsoft and deserves comparison with ZFS. While the other best in class file systems ( most notably Btrfs) all scramble to meet modern computing needs with an almost identical feature set to ZFS, WinFS was attempting a complete rethink of a file system.
The two most notable features of WinFS was its database based design and it's peer-to-peer replication services.
Unlike traditional file systems including cutting edge ZFS, WinFS was moving away from the hierarchical structure file systems regarding search. The file data was still going to reside in a hierarchical structure that was based on NTFS, but there was meta-data stored in a relational database. This would allow searches that transcended normal file meta-data like time stamp, name, text content. Meta-data could be added to any file type and resulting searches would return a collection of file types. Searches like "all the people I have pictures of while I was on my trip in Thailand and whose email address I have" would be understood. The query would return the pictures, email traffic, contact card from organizer software.
Microsoft also recognized the increasing importance of being able to replicate and synchronize data between devices. WinFS from the ground up was designed with the ability to replicate data across thousands of computers. The synchronization algorithms were also intended to resolve conflicts with transparency to the user across potentially unreliable networks.
Comparison
Upon first inspection, Btrfs seems near identical to ZFS. However, Btrfs does lack some features of ZFS. First of all, Btrfs doesn't have the self-healing capability or data deduplication of ZFS. Second, ZFS also support more configurations of software RAID than Btrfs. However, since ZFS does have three more years of development it is still very possible for Btrfs to catch up.
WinFS provided an interesting set of features, many of which are completely different than ZFS, because it approached the problem differently. While both were attempting to meet the needs of the modern computer, ZFS was more focused on server settings and WinFS seemed to be focused on the requirements of the home PC user. WinFS was not focused on performance or large memory storage needs and therefore it would not have been able to serve as a useful replacement for traditional file systems.
These two file systems demonstrate that, while at this point no one system is capable of providing the full functionality and capabilities of ZFS, systems such as Btrfs are, at least, coming close. Perhaps someone will also take up WinFS and change it so it meets a wider needs profile. However, if we are dicussing a viable alternative to file systems today, ZFS is currently the best choice.
Conclusion
ZFS is a vast improvement over traditional file systems. It's modularization, administrative simplicity, self-healing, use of storage pool, and POSIX compliance make it a viable file system replacement. Simplicity and administrative ease is perhaps one of its more important features. In fact, that was the most attractive feature to PPPL (Princeton Plasma Physics Laboratory). PPPL collects data from plasma experiments [Z5].
The laboratory systems administrators were having problems with their ufs (Unix file system). Given the increasingly growing needs for additional and larger disks and the problems with administrators continuously switching disks, performing partitions, copying old data to the new larger disks there was a significant amount of user downtime. Therefore, the administrators were attracted to the storage pool concept, and to the added flexibility provided by ZFS.
Lastly, given the increasing demands not only on vast amounts of storage, but on the continuous availability of that storage, whether accessed via a smart-phone or a server, time wasted on partitions, disk swapping, and any similar activities, any inefficiency in traditional file systems will soon, if it hasn't already, become extremely intolerable.
Therefore, ZFS should be seriously considered as a smart file system solution for today and for the foreseeable future.
References
- Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. Demystifying data deduplication. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.
- Geer, D.; , "Reducing the Storage Burden via Data Deduplication," Computer , vol.41, no.12, pp.15-17, Dec. 2008
- Bonwick, J.; ZFS Deduplication. Jeff Bonwick's Blog. November 2, 2009.
- Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]
- C. Pugh, P. Henderson, K. Silber, T. Caroll, K. Ying, Information Technology Division, Princeton Plasma Physics Laboratory (PPPL), Utilizing ZFS For the Storage of Acquired Data [Z5]
- Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; End-to-end Data Integrity for File Systems: A ZFS Case Study. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.
- S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3
- Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [1].
- Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [2].
- Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [3].
- Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [4].
- ZFS FAQ - opensolaris [5].
- Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]
- Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]
- Microsoft- TechNet.(March 28, 2003) "How NTFS Works" [6].
- Microsoft-TechNet. "File Systems" [7].
- Microsoft-TechNet. ( Sept. 3, 2009) "Best practices for NTFS compression in Windows" [8].
- Mathur, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., VIVIER, L., AND S.A.S., B. 2007. The new Ext4 filesystem: current status and future plans. In Ottawa Linux Symposium (OLS’07).
- Heger Dominique A., (Post 2007), "Workload Dependent Performance Evaluation of the Btrfs and ZFS Filesystems", DHTechnologies
- Unaccredited, "Btrfs Design", Oracle
- Mason Chris, (2007), "The Btrfs Filesystem", Oracle
- Novik Lev, Irena Hudis, Douglas B. Terry, Sanjay Anand, Vivek J. Jhaveri, Ashish Shah, Yunxin Wu (2006), "Peer-to-Peer Replication in WinFS", Microsoft Corporation
- Rector Brent, (2004),"Chapter 4. Storage", Wise Owl Consulting