COMP 3000 Essay 1 2010 Question 9: Difference between revisions
added some real meat to the deduplication section |
skeletal pla section |
||
Line 16: | Line 16: | ||
=== Major Features of ZFS === | === Major Features of ZFS === | ||
====Physical Layer Abstraction==== | |||
* volume management and file system all in one | |||
* file systems on top of zpools on top of vdevs on top of physical devices | |||
* file systems easily and often span over many physical devices. | |||
* ridiculous capacity | |||
====Data Integrity==== | ====Data Integrity==== |
Revision as of 16:22, 9 October 2010
Question
What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)
Answer
ZFS was developed by Sun Microsystems (now owned by Oracle) as a server class file systems. This differs from most file systems which were developed as desktop file systems that could be used by servers. With the server being the target for the file system particular attention was paid to data integrity, size and speed.
One of the most significant ways in which the ZFS differs from traditional file systems is the level of abstraction. While a traditional file system abstracts away the physical properties of the media upon which it lies i.e. hard disk, flash drive, CD-ROM, etc. ZFS abstracts away if the file system lives one or many different pieces of hardware or media. Examples include a single hard drive, an array of hardrives, a number of hard drives on non co-located systems.
One of the mechanisms that allows this abstraction is that the volume manager which is normally a program separate from the file system in traditional file systems is moved into ZFS.
ZFS is a 128-bit file system allowing this allows addressing of 2128 bytes of storage.
Major Features of ZFS
Physical Layer Abstraction
- volume management and file system all in one
- file systems on top of zpools on top of vdevs on top of physical devices
- file systems easily and often span over many physical devices.
- ridiculous capacity
Data Integrity
- Checksums
- self monitoring/self healing using mirroring/copy-on-write.
- transactional based file IO
- system snapshots.
Data Deduplication
Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.
Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.
The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.
In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.
References
- Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. Demystifying data deduplication. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.
- Geer, D.; , "Reducing the Storage Burden via Data Deduplication," Computer , vol.41, no.12, pp.15-17, Dec. 2008
- Bonwick, J.; ZFS Deduplication. Jeff Bonwick's Blog. November 2, 2009.