COMP 3000 Essay 1 2010 Question 9: Difference between revisions

From Soma-notes
Azemanci (talk | contribs)
Azemanci (talk | contribs)
Line 94: Line 94:


====Comparison====
====Comparison====
The most noticeable difference when comparing ZFS to other current file systems is the size.  NTFS allows for a maximum volume 256TB and ext4 allows for 1EB.  ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system.  After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers.  ZFS has the ability to self heal which neither of the two current file systems.  This improves performance as there is no need for down time to scan the disk to check for errors.


== '''Future File Systems''' ==
== '''Future File Systems''' ==

Revision as of 20:15, 14 October 2010

Question

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

Answer

Introduction

ZFS was developed by Sun Microsystems to handle the functionality required by a file system, as well as a volume manager. Some of the motivations behind the development of ZFS were modularity and simplicity, immense scalability (ZFS is a 128-bit file system), and ease of administration. Also, the designers were keen to avoid some of the pitfalls of traditional file systems. Some of these problems are possible data corruption, especially silent corruption, inability to expand and shrink storage dynamically, inability to fix bad blocks automatically, as well as a less than desired level of abstractions and simple interfaces. [Z2]

ZFS

ZFS differs from major traditional file systems in various ways. Some of the primary ones are modularity, virtualization of storage, and the ability to self-repair. A brief look at ZFS' various components will help illustrate those differences.

The following subsystems makeup ZFS [Z3. P2].

# SPA (Storage Pool Allocator).
# DSL (Data Set and snapshot Layer).	
# DMU (Data Management Unit).
# ZAP (ZFS Attributes Processor).
# ZPL (ZFS POSIX Layer).
# ZIL (ZFS Intent Log).
# ZVOL (ZFS Volume).

The ways in which these components deliver the aforementioned characteristics are illustrated next. Modularity is achieved in the same way as any non-trivial software system, i.e. via the division of responsibilities across various modules (in this case, seven modules). Each one of these modules provides a specific functionality, as a consequence, the entire system becomes simpler and easier to maintain.

Storage virtualization is simplified by the removal of a volume manager, a component common in traditional file systems. The main issue with a volume manager is that it doesn't abstract the underlying physical storage enough. In other words, physical blocks are abstracted as logical ones. Yet, a consecutive sequence of blocks is still needed when creating a file system. That leads to issues such as the need to partition the disk (or disks) and the fact that once storage is assigned to a particular file system, even when not used, can not be shared with other file systems. In ZFS, the idea of pooled storage is used. A storage pool abstracts all available storage, and is managed by the SPA. The SPA can be thought of as a collection of API's to allocate and free blocks of storage, using the blocks' DVA's (data virtual addresses). It behaves like malloc() and free(). Instead of memory though, physical storage is allocated and freed. The main point here is that all the details of the storage are abstracted from the caller. ZFS uses DVA's to simplify the operations of adding and removing storage. Since a virtual address is used, storage can be added and/or removed from the pool dynamically. Since the SPA uses 128-bit block addresses, it will take a very long time before that technology encounters limitations. Even then, the SPA module can be replaced, with the remaining modules of ZFS intact. To facilitate the SPA's work, ZFS enlists the help of the DMU and implements the idea of virtual devices.

Virtual devices (vdevs) abstract virtual device drivers. A vdev can be thought of as a node with possible children. Each child can be another virtual device ( i.e. a vdev ) or a device driver. The SPA also handles the traditional volume manager tasks like mirroring for example. It accomplishes such tasks via the use of vdevs. Each vdev implements a specific task. In this case, if SPA needed to handle mirroring, a vdev would be written to handle mirroring. It is clear here that adding new functionality is straightforward, given the clear separation of modules and the use of interfaces.

The DMU (Data Management Unit) accepts blocks as input from the SPA, and produces objects. In ZFS, files and directories are viewed as objects. An object in ZFS is labelled with a 64 bit number and can hold up to 2^64 bytes of information. [Z1. P8].

ZFS uses the idea of a dnode. A dnode is a data structure that stores information about blocks per object. In other words, it provides a lower level abstraction so that a collection of one ore more blocks can be treated as an object. A collection of objects thus, referred to as object sets, is used to describe the file system. In essence then, in ZFS, a file system is a collection of object sets, each of which is a collection of objects, and in turn, a collection of blocks. Such levels of abstraction increase ZFS' flexibility and simplifies the management of a file system. [Z3 P2].

TO-DO:

  ZPL and the common interface 

TO-DO: How ZFS maintains Data integrity and accomplishes self healing ?

copy-on-write checksumming Use of Transactions

TO-DO : Not finished yet --Tawfic

In ZFS, the ideas of files and directories are replaced by objects.

Data Integrity

At the lowest level, ZFS uses checksums for every block of data that is written to disk. The checksum is checked whenever data is read to ensure that data has not been corrupted in some way. The idea is that if either the block or the checksum is corrupted, then recalculating the checksum for the block will result in a mismatch between the calculated and stored checksums. It is possible that both the block and checksum record could be corrupted, but the probability of the corruption being such that the corrupted block's checksum matches the corrupted checksum is exceptionally low.

In the event that a bad checksum is found, replication of data, in the form of "Ditto Blocks" provide an opportunity for recovery. A block pointer in ZFS is actually capable of pointing to multiple blocks, each of which contains duplicate data. By default, duplicate blocks are only stored for file system metadata, but this can be expanded to user data blocks as well. When a bad checksum is read, ZFS is able to follow one of the other pointers in the block pointer to hopefully find a healthy block.

RAID setups are particularly well suited to ZFS, since there is already an abstraction between the physical storage and the zpools. Besides protecting from outright total disk failure, if a bad checksum is found, there is the possibility that one of the alternate disks has a healthy version. If these errors accumulate, it can signal an impending drive failure. When a drive does fail, some of our tolerance for data loss is consumed; that is, the system is operating at less than 100% redundancy (however that is defined for the system at hand). To address this, ZFS supports "hot spares", idle drives that can be brought online automatically when another drive fails so that full redundancy can be rebuilt with minimal delay, hopefully in time for the next drive failure.

With block-by-block data integrity well in hand, ZFS also employs a transactional update model to ensure that higher level data structures remain consistent. Rather than use a journal to allow for quick consistency checking in the event of a system crash, ZFS uses a copy-on-write model. New disk structures are written out in a detached state. Once these structures have been written and checked, then they are connected to the existing disk structures in one atomic write, with the structures they replace becoming disconnected.

At the user level, ZFS supports file-system snapshots. Essentially, a clone of the entire file system at a certain point in time is created. In the event of accidental file deletion, a user can access an older version out of a recent snapshot.

Data Deduplication

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

Legacy File Systems

Files exist on memory sources such as hard disks and flash memory, and when saving these files onto memory sources there must be an abstraction that organizes how these files will be stored and later retrieved. The abstraction that is used is a file system, and one such file system is FAT32, and another is ext2.

FAT32

When files are stored onto storage devices, the storage device`s memory is made up of sectors (usually 512bytes) . Initially it was planned so that these sectors would contain the data of a file, and that some larger files would be stored as multiple sectors. In order for one to attempt to retrieve a file, each sector must have been stored and also documented on which sectors contained the data of the requested file. Since the size of each sector is relatively small in comparison to larger files that exist in the world, it would take significant amounts of time and memory to document each sector with the file it is associated with and where it is located. Because of the inconvenience of having so many sectors documented, the FAT file system has implemented clusters; which are a defined grouping of sectors. These clusters would serve as groupings of sectors and each cluster would be related to one file. An issue that has been discovered about using clusters is the event of storing a file that was smaller than a cluster, then the file would take up space in the cluster and no other file would be able to access the unused sectors in that cluster. For the FAT32 file system, the name FAT stands for File Allocation Table, which is the the table that contains entries of the clusters in the storage device and their properties. The FAT is designed as a linked list data structure which holds in each node a cluster’s information. “ For the FAT, the device directory contains the name, size of the file and the number of the first cluster allocated to that file. The entry in the table for that first cluster of that particular file contains the number of the second cluster in that file. This continues until the last cluster entry for that file which will contain all F’s indicating it is used and the last cluster of the file. The first file on a new device will use all sequential clusters. Hence the first cluster will point to the second, which will point to the third and so on.”#2.3a The digits specified beside each naming of a FAT system, as is in FAT32, means that the file allocation table is an array of 32-bit values. #2.3b Of the 32-bits, 28 of them are used to number the clusters in the storage device, therefore this means that 2^28 clusters are available. Issues that arise from having larger clusters is when files are drastically smaller than the cluster size, because then there is a lot of excess wasted space in the cluster. When clusters are being used to contain a file, when the file is accessed the file system must find all clusters that go together that make up the file, this process takes long if the clusters are not organized. When files are also deleted, the clusters are modified as well and leave empty clusters available for new data, because of this, some files may have their clusters scattered through the storage device and when accessing the file it would take longer to access. FAT32 does not include a defragmentation system, but all of the recent Windows OS’ come with a defragmentation tool for users to use. Defragging allows for the storage device to organize the fragments of a file (clusters) so that they reside near each other, which helps with the timing it takes to access a file from the file system. Since reorganization (defragging) is not a default function in the FAT32 system, when trying to store a file, looking for a empty space requires a linear search through all the clusters, this is one of the drawbacks to using FAT32, it is slow. The first cluster of every FAT32 file system contains information about the operating system, root directory, and always contains 2 copies of the file allocation table so that in the case of the file system being interrupted, a secondary FAT is available to be used to recover the files.

Ext2

The ext2 file system (second extended file system) was designed after the UFS (Unix File System) and attempts to mimic certain functionalities of UFS yet remove unnecessary ones as well. Ext2 organizes the memory space into blocks, which are then seperated into block groups (similar to the cylinder groups in UFS) system. There is a superblock that is a block that contains basic information, such as the block size, the total number of blocks, the number of blocks per block group, and the number of reserved blocks before the first block group.The superblock also contains the total number of inodes and the number of inodes per block group.#2.3c Files in ext2 are represented by Inodes. Inodes are a structure that contain the description of the file, file type, access rights, owners, timestamps, size, and the pointers to the data blocks that hold the files data. In FAT32 the file allocation table was used to define the organization of how file fragments were, and it was vital to have duplicate copies of this FAT just in case of crashes. Just as it was in FAT32 with having duplicate copies of the FAT in the first cluster, the first block in ext2 is the superblock and it also contains the list of group descriptors (each group block has a group descriptor to map out where files are in the group) Backup copies of the superblock and group descriptors exist through the system in case the primary source gets affected. (each group has a group descriptor). These backup copies are used when the system had an unclean shutdown and requires the use of the “fsck” (file system checker) which traverses through the inodes and directories to repair any inconsistencies.#2.3d

Comparison

When observing how storage devices are managed using different file systems, one can notice that FAT32 file system has a max volume of 2TB(8TB -32KB clusters, 16TB -64KB clusters), 32TB for the ext2, and the ZFS contains 2^58 ZB(Zettabyte), where each ZB is 2^70 bytes(quite larger). “ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process”#2.3e , because of this, the use a fsck is not used in ZFS, where as it is in the ext2 filesystem. Not having to check for inconsistencies allows for ZFS to save time and resources by not systematically going through a storage device. The FAT32 file system manages a storage device, and the ext2 file system also just manages a storage device, where as the ZFS utilizes a volume manager that can control multiple storage devices, virtual or physical. By being able to manage multiple storage devices under one file system means that resources are now available throughout the system, and that nothing is unavailable when accessing the data from the ZFS.

Current File Systems

NTFS

New Technology File Systems also known as NTFS was first introduced with Windows NT and is currently being used on all modern Microsoft operating systems. The NTFS file system creates volumes which are then broken down in to clusters much like the FAT32 file system. Volumes contain several components. A volume contains a NTFS Boot Sector, Master File Table, File System Data and a Master File Table Copy. The NTFS boot sector holds the information that communicates to the BIOS the layout of the volume and the file system structure. The Master File Table holds all the metadata in regards to all the files in the volume. The File System Data stores all data that is not included in the Master File Table. Finally the Master File Table Copy is a copy of the Master File Table.[1] Having the copy of the MFT ensures that if there is an error with the primary MTF the file system can still be recovered. The MFT keeps track of all file attributes in a relational database. The MFT is also part of this database. Every file that is in a volume has a record created for it in the MFT. NTFS is journal file system which means it utilizes a journal to ensure data integrity. The file system enters changes into the journal before they are made in case there is an interruption while those changes are being written to disk. A specific feature to NTFS is a change journal which records all changes that are made to the file system this assist’s in performance as the entire volume does not have to be scanned to find changes. [2] NTFS also allows for compression of files to save disk space unfortunately it can affect performance. Performance is affected because in order to move compressed files they must first be decompressed then transferred and recompressed. NTFS does have certain volume and size constraints. [3] NTFS is a 64-bit file system which allows for 2^64 bytes of storage. NTFS is capped at a maximum file size of 16TB and a maximum volume size of 256TB.[1]

ext4

Fourth Extended File System also known as ext4 is a Linux file system. Ext4 also uses volumes like NTFS but does not use clusters. It was designed to all for greater scalability then ext3. Ext4 uses extents which is a descriptor representing contiguous physical blocks. [4] Extents represent the data that is stored in the volume. They allow for better performance when handling large files when compared with ext3. Ext4 file system is also a journaling file system. It records changes to be made in a journal then makes the changes in case there is a chance of interruption while writing to the disk. In order to ensure data integrity ext4 utilizes check summing. In ext4 a checksum has been implemented in the journal do to the high importance of the data stored there. [4]. Ext4 does not support compression so there are no slowdowns when moving data as there is no compressing and decompressing. Ext4 uses a 48-bit physical block rather than the 32-bit block used by ext3. The 48-bit block increases the maximum volume size to 1EB up from the 16TB max of ext3.[4]

Comparison

The most noticeable difference when comparing ZFS to other current file systems is the size. NTFS allows for a maximum volume 256TB and ext4 allows for 1EB. ZFS allow for a maximum file system of 16EB which is 16 times more than the current ext4 Linux file system. After viewing the amount of storage available to the current file systems it is clear that ZFS is better suited to servers. ZFS has the ability to self heal which neither of the two current file systems. This improves performance as there is no need for down time to scan the disk to check for errors.

Future File Systems

BTRFS

--posted by [Naseido] -- just starting a rough draft for an intro to B-trees --source: http://www.dhtusa.com/media/IOPerf_CMG09DHT.pdf ( found through Google Scholar)

BTRFS, B-tree File System, is a file system that is often compared to ZFS because it has very similar functionality even though a lot of the implementation is different. BTRFS is based on the b-tree structure where a subvolume is a named b-tree made up of the files and directories stored.

WinFS

References

  • Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. Demystifying data deduplication. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.
  • Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System [Z3]
  • 2.3a - S.Tenanbaum, A. (2008). Modern operating systems. Prentice Hall. Sec: 1.3.3
  • 2.3b - Dr.William F. Heybruck.(August 2003). An Introduction to FAT 16/FAT 32 File Systems. [1].
  • 2.3c - Raymond Chen. Windows Confidential -A Brief and Incomplete History of FAT32.(2008) [2].
  • 2.3d - Carrier, B. (2005). File system forensic analysis. Addison-Wesley Professional. P.287 [3].
  • 2.3e - Brokken, F, & Kubat, K. (1995). Proceedings to the first dutch international symposium on linux, amsterdam, december 8th and 9th, 1994. [4].
  • 2.3f - ZFS FAQ - opensolaris [5].
  • Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, Mark Shellenbaum. Sun Micro Systems, The Zettabyte File System [Z1]
  • Romik Guha Anjoy, Soumya Kanti Chakraborty, Malardalen University, Sweden. Feature Based Comparison of Modern File Systems [Z2]

[1] http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx

[2] http://technet.microsoft.com/en-us/library/cc938919.aspx

[3] http://support.microsoft.com/kb/251186

[4] http://www.kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf