Revision as of 06:41, 14 October 2010

Question

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

Answer

Introduction

ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption, especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

ZFS

ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].

# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).	
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects.

TO-DO: How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write checksumming Use of Transactions

TO-DO : Not finished yet --Tawfic

Problems a ZFS attempts to tackle/avoid

Loosing important files
Running out of space on a partition
Booting with a damaged root file system.

Issues with existing File Systems

No way to prevent silent data corruptions
E.g. defects in a controller, disk, firmware . . etc can corrupt data silently.
Hard to manage
Limits on file sizes, number of files, files per directory..etc

In ZFS, the ideas of files and directories are replaced by objects.

Physical Layer Abstraction

volume management and file system all in one
file systems on top of zpools on top of vdevs on top of physical devices
file systems easily and often span over many physical devices.
ridiculous capacity

Data Integrity

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

Data Deduplication

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

Legacy File Systems

Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.

FAT32

When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

Ext2

The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d

Comparison

When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for the ext2 file system to save time and resources by not systematically going through a storage device. ZFS uses a volume manager that controls many other file systems, where as the FAT32 file system would only be able to manage its limited storage space on a device by having to create another FAT32 file system to manage other storage devices.

Current File Systems

NTFS

New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector hold the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table. This ensures that if there is an error with the Master File Table the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. There are advantages to NTFS over other file systems; they are Recoverability Reliability, Compression and Security. NTFS implements something known as a change journal to allow the volume to be recoverable if there are an errors. The Change Journal is another recorded stored by the file system which records all the changes made to files and directories. Having the change journal also improves reliability as it can be used to correct errors in the volume. NTFS supports compression of files. Files can and are compressed in an effort to decrease the amount of space they require in the volume. With NTFS security was taken into account. Stored within the metadata in the MFT are permissions for each individual file which allows only individuals with the correct permissions to access the files. NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.

ext4

Comparison

Future File Systems

BTRFS

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees --source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

WinFS

References

Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. Demystifying data deduplication. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.

Geer, D.; , "Reducing the Storage Burden via Data Deduplication," Computer , vol.41, no.12, pp.15-17, Dec. 2008

Bonwick, J.; ZFS Deduplication. Jeff Bonwick's Blog. November 2, 2009.

Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]

Zhang, Yupu and Rajimwale, Abhishek and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H.; End-to-end Data Integrity for File Systems: A ZFS Case Study. FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, Berkley, CA, USA.

2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3

2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [1].

2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [2].

2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [3].

2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [4].

2.3f - ZFS FAQ - opensolaris [5].

Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]

Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

@@ Line 6: / Line 6: @@
 == '''Introduction''' ==
-TO-DO: Edit, expand, revise
+ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity,
+immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption,
-ZFS was developed by Sun Microsystems (now owned by Oracle) as a server class file systems.  This differs from most file systems which were developed as desktop file systems that could be used by servers.  With the server being the target for the file system particular attention was paid to data integrity, size and speed.
+especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]
-One of the most significant ways in which the ZFS differs from traditional file systems is the level of abstraction.  While a traditional file system abstracts away the physical properties of the media upon which it lies i.e. hard disk, flash drive, CD-ROM, etc. ZFS abstracts away if the file system lives one or many different pieces of hardware or media.  Examples include a single hard drive, an array of hardrives, a number of hard drives on non co-located systems.
-One of the mechanisms that allows this abstraction is that the volume manager which is normally a program separate from the file system in traditional file systems is moved into ZFS.
-ZFS is a 128-bit file system allowing this allows addressing of 2<sup>128</sup> bytes of storage.
 == '''ZFS''' ==

Difference between revisions of "COMP 3000 Essay 1 2010 Question 9"

Revision as of 06:41, 14 October 2010

Contents

Question

Answer

Introduction

ZFS

Physical Layer Abstraction

Data Integrity

Data Deduplication

Legacy File Systems

FAT32

Ext2

Comparison

Current File Systems

NTFS

ext4

Comparison

Future File Systems

BTRFS

WinFS

References

Navigation menu

Difference between revisions of "COMP 3000 Essay 1 2010 Question 9"

Revision as of 06:41, 14 October 2010

Question

Answer

Introduction

ZFS

Physical Layer Abstraction

Data Integrity

Data Deduplication

Legacy File Systems

FAT32

Ext2

Comparison

Current File Systems

NTFS

ext4

Comparison

Future File Systems

BTRFS

WinFS

References

Navigation menu

Search