Difference between revisions of "COMP 3000 Essay 1 2010 Question 9"

From Soma-notes
Jump to navigation Jump to search
(added data deduplication to the overview of ZFS features; minor editing of existing text)
(added some real meat to the deduplication section)
Line 17: Line 17:
=== Major Features of ZFS ===
=== Major Features of ZFS ===


'''Data Integrity'''
====Data Integrity====


* Checksums
* Checksums
Line 24: Line 24:
* system snapshots.
* system snapshots.


'''Data Deduplication'''
====Data Deduplication====


Duplicated data is recorded only once physically, with those blocks mapped to multiple files.  Think of an email database where there may be 100 copies of the same message with the same 20MB attachmentOverall physical storage required can be reduced, which can have important consequences for data center power, space, and cooling needs.
Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data.  Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.
 
Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set.  There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it.  In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed.  At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.
 
The actual analysis and deduplication of incoming files can occur in-band or out-of-bandIn-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state.  While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested.  In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons.  With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way).  A background process analyzes these files at a later time to perform the compressionThis method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.
 
In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client.  ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.
 
== References ==
 
* Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani, S. 2008. [http://portal.acm.org.proxy.library.carleton.ca/citation.cfm?id=1462739 Demystifying data deduplication]. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion  (Leuven, Belgium, December 01 - 05, 2008). Companion '08. ACM, New York, NY, 12-17.
* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008
* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.

Revision as of 12:14, 9 October 2010

Question

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

Answer

ZFS was developed by Sun Microsystems (now owned by Oracle) as a server class file systems. This differs from most file systems which were developed as desktop file systems that could be used by servers. With the server being the target for the file system particular attention was paid to data integrity, size and speed.

One of the most significant ways in which the ZFS differs from traditional file systems is the level of abstraction. While a traditional file system abstracts away the physical properties of the media upon which it lies i.e. hard disk, flash drive, CD-ROM, etc. ZFS abstracts away if the file system lives one or many different pieces of hardware or media. Examples include a single hard drive, an array of hardrives, a number of hard drives on non co-located systems.

One of the mechanisms that allows this abstraction is that the volume manager which is normally a program separate from the file system in traditional file systems is moved into ZFS.

ZFS is a 128-bit file system allowing this allows addressing of 2128 bytes of storage.


Major Features of ZFS

Data Integrity

  • Checksums
  • self monitoring/self healing using mirroring/copy-on-write.
  • transactional based file IO
  • system snapshots.

Data Deduplication

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

References