Difference between revisions of "COMP 3000 Essay 1 2010 Question 9"

From Soma-notes
Jump to navigation Jump to search
(skeletal pla section)
Line 47: Line 47:
* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008
* Geer, D.; , [http://ieeexplore.ieee.org.proxy.library.carleton.ca/xpls/abs_all.jsp?arnumber=4712493 "Reducing the Storage Burden via Data Deduplication,"] Computer , vol.41, no.12, pp.15-17, Dec. 2008
* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.
* Bonwick, J.; [http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup ZFS Deduplication]. Jeff Bonwick's Blog. November 2, 2009.
* Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System

Revision as of 17:31, 10 October 2010

Question

What requirements distinguish the Zettabyte File System (ZFS) from traditional file systems? How are those requirements realized in ZFS, and how do other operating systems address those same requirements? (Please discuss legacy, current, and in-development systems.)

Answer

ZFS was developed by Sun Microsystems (now owned by Oracle) as a server class file systems. This differs from most file systems which were developed as desktop file systems that could be used by servers. With the server being the target for the file system particular attention was paid to data integrity, size and speed.

One of the most significant ways in which the ZFS differs from traditional file systems is the level of abstraction. While a traditional file system abstracts away the physical properties of the media upon which it lies i.e. hard disk, flash drive, CD-ROM, etc. ZFS abstracts away if the file system lives one or many different pieces of hardware or media. Examples include a single hard drive, an array of hardrives, a number of hard drives on non co-located systems.

One of the mechanisms that allows this abstraction is that the volume manager which is normally a program separate from the file system in traditional file systems is moved into ZFS.

ZFS is a 128-bit file system allowing this allows addressing of 2128 bytes of storage.


Major Features of ZFS

Physical Layer Abstraction

  • volume management and file system all in one
  • file systems on top of zpools on top of vdevs on top of physical devices
  • file systems easily and often span over many physical devices.
  • ridiculous capacity


Data Integrity

  • Checksums
  • self monitoring/self healing using mirroring/copy-on-write.
  • transactional based file IO
  • system snapshots.

Data Deduplication

Data Deduplication is a method of interfile storage compression, based around the idea of storing any one block of unique data only once physically, and logically linking that block to each file that contains that data. Effective use of data deduplication can reduce the space and power requirements of physical storage, but only if you have data that lends itself to deduplication.

Data Deduplication schemes are typically implemented using hash-tables, and can be applied to whole files, sub files (blocks), or as a patch set. There is an inherit trade off between the granularity of your deduplication algorithm and the resources needed to implement it. In general, as you consider smaller blocks of data for deduplication, you increase your "fold factor", that is, the difference between the logical storage provided vs. the physical storage needed. At the same time, however, smaller blocks means more hash table overhead and more CPU time needed for deduplication and for reconstruction.

The actual analysis and deduplication of incoming files can occur in-band or out-of-band. In-band deduplication means that the file is analyzed as it arrives at the storage server, and written to disk in its already compressed state. While this method requires the least over all storage capacity, resource constraints of the server may limit the speed at which new data can be ingested. In particular, the server must have enough memory to store the entire deduplication hash table in memory for fast comparisons. With out-of-band deduplication, inbound files are written to disk without any analysis (so, in the traditional way). A background process analyzes these files at a later time to perform the compression. This method means higher overall disk I/O is needed, which can be a problem if the disk (or disk array) is already at I/O capacity.

In the case of ZFS, which is typically hosted as a server-side file system, the server itself performs all of the deduplication and reconstruction; the entire process is transparent to the client. ZFS assumes that it is running on a highly multi-threaded operating system and that CPU cycles are in greater abundance than disk I/O cycles, and thus performs the deduplication in-band.

References

  • Andrew Li, Department of Computing Macquarie University, Zettabyte File System Autopsy: Digital Crime Scene Investigation for Zettabyte File System