COMP 3000 Essay 1 2010 Question 9
Question
What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)
Answer
Introduction
TO-DO: Edit, expand, revise
ZFS was developed by Sun Microsystems (now owned by Oracle) as a server class file systems. This differs from most file systems which were developed as desktop file systems that could be used by servers. With the server being the target for the file system particular attention was paid to data integrity, size and speed.
One of the most significant ways in which the ZFS differs from traditional file systems is the level of abstraction. While a traditional file system abstracts away the physical properties of the media upon which it lies i.e. hard disk, flash drive, CD-ROM, etc. ZFS abstracts away if the file system lives one or many different pieces of hardware or media. Examples include a single hard drive, an array of hardrives, a number of hard drives on non co-located systems.
One of the mechanisms that allows this abstraction is that the volume manager which is normally a program separate from the file system in traditional file systems is moved into ZFS.
ZFS is a 128-bit file system allowing this allows addressing of 2128 bytes of storage.
Major Features of ZFS
Note to self: The following is a quick reminder of the main points and will be revised soon. Some content will be moved to the intro. -- Tawfic
ZFS stands for Zettabyte File System. A ZFS combines a file system with a logical volume manager. A ZFS file system can store up to 2^58 zettabytes. Each ZB is 2^70 bytes. A traditional file system (refereed to as TFS hereafter) uses the idea of volumes to manage storage. By contrast, ZFS uses the idea of pooled storage.
Issues with volume storage (used by TFS)
# The FS can not use disks under a different volume. # Need a volume manager (another layer) to manage the volume. # Hard to grow/shrink. # Storage is fragmented.
Advantages of pooled storage
# No partitions to manage. # All free storage space is always available. # Easy to grow/shrink.
Problems a ZFS attempts to tackle/avoid
Loosing important files Running out of space on a partition Booting with a damaged root file system.
Issues with existing File Systems
No way to prevent silent data corruptions E.g. defects in a controller, disk, firmware . . etc can corrupt data silently. Hard to manage Limits on file sizes, number of files, files per directory..etc
In ZFS, the ideas of files and directories are replaced by objects.
ZFS is composed of seven components (achieving, and going beyond what a TFS provides ):
# SPA (Storage Pool Allocator). # DSL (Data Set and snapshot Layer). # DMU (Data Management Unit). # ZAP (ZFS Attributes Processor). # ZPL (ZFS POSIX Layer). # ZIL (ZFS Intent Log). # ZVOL (ZFS Volume).
Physical Layer Abstraction
- volume management and file system all in one
- file systems on top of zpools on top of vdevs on top of physical devices
- file systems easily and often span over many physical devices.
- ridiculous capacity
Data Integrity
At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.
In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.
RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.
With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.
At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.
Data Deduplication
Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.
Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.
The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.
In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.
Legacy File Systems
Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.
FAT32
When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#1.2a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #1.2b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.
Ext2
The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#1.2c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#1.2d
Comparison
When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#1.2e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for the ext2 file system to save time and resources by not systematically going through a storage device. ZFS uses a volume manager that controls many other file systems, where as the FAT32 file system would only be able to manage its limited storage space on a device by having to create another FAT32 file system to manage other storage devices.
Current File Systems
Future File Systems
BTRFS
--posted by [Naseido] -- just starting a rough draft for an intro to B-trees --source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)
BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.
WinFS
References
- Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. Demystifying data deduplication. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.
- Geer, D.; , "Reducing the Storage Burden via Data Deduplication," Computer , vol.41, no.12, pp.15-17, Dec. 2008
- Bonwick, J.; ZFS Deduplication. Jeff Bonwick's Blog. November 2, 2009.
- Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]
- Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; End-to-end Data Integrity for File Systems: A ZFS Case Study. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.
- 1.2a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3
- 1.2b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [1].
- 1.2c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [2].
- 1.2d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [3].
- 1.2e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [4].
- 1.2f - ZFS FAQ - opensolaris [5].
- Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]
- Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]