David Barrera

dbarrera@ccsl.carleton.ca

PDF available at http://www.ccsl.carleton.ca/~dbarrera/distos.pdf

Abstract

This paper is a survey of existing approaches to naming and locating resources in distributed file systems. We survey proposals from the past 20 years and find that while there have been many improvements in the hardware that powers distributed file systems, there are only a few well known proposals for dealing with resource location an naming.

Introduction

The ability to name resources is important in any file system. Mapping machine readable names to human readable names allows users to forget about the way the operating system (OS) is handling file access, and focus on completing desired tasks.

In traditional file systems, users are mostly responsible for creating meaningful file hierarchies for storing and later searching for files. Users must be aware of file system restrictions (e.g., file name length, file size, etc.). The underlying file system is only in charge of moving data to or from physical storage media. Distributed file systems offer a series of advantages to users (e.g., increased storage space and data reliability), but must be designed such that end-users are not aware of all the logic and processing ocurring in the background. Indeed, a distributed file system loses its appeal if the user is required to do all the heavy lifting.

Take for example, an end-user wanting to access a PDF document. In a local file system, the user must only locate the PDF file in the file hierarchy, and retrieve from disk. In a distributed file system, the PDF file might be stored on a remote server, or perhaps stored multiple times on multiple servers. The problem then becomes how to enable end-users to locate the correct copy of a file amongst a large volume of shared data.

This paper focuses on two important aspects of distributed file systems: (1) how files are named or identified uniquely; and (2) how files are found by clients or metadata servers once they are stored in the network. We survey distributed file systems and file system designs from as early as 1989 and as recently as 2006. We find that there are a relatively small number of ways a distributed file system can approach the problem of naming and locating files, and the selected approach is always dependent on the requirements of the system.

Naming Resources

On non-distributed systems (e.g., a stand-alone desktop computer), file systems use an object's absolute path as a unique identifier for that object in the file system. This usually translates to meaning that there can't be two objects with the same name in the same location (e.g., a directory like =/home/dbarrera/files/= can't contain two files called =file1= ). In distributed file systems, there is an obvious need for allowing multiple files with the same human-readable name, and perhaps even the same absolute path (although relative to a particular client) as other clients sharing storage on the system. This section reviews methods used by existing distributed file systems to handle object naming at a massive (sometimes global) scale.

Depending on the requirements of the file system (maximum number of clients, concurrent read/writes, etc.), different approaches to naming might be taken. Some file systems such as Coda [10], aim to mimic the UNIX-like file naming. Others systems have relaxed POSIX-like behaviour to allow for better scalability and speed.

Local Naming

The Sun Network File System (NFS) specifies that each client sees a UNIX file namespace with a private root. Due to each client being free to manage its own namespace, several workstations mounting the same remote directory might not have the same view of the files contained in that directory. However, if file-sharing or location transparency is required, it can be achieved by convention (e.g., users agreeing on calling a file a specific name) rather than by design.

One of the first distributed file systems, the Apollo DOMAIN File System [6] uses 64-bit unique identifiers (UIDs) for every object in the system. Each Apollo client also has a UID created the time of its manufacture. When a new file is created, the UID for that file is derived from the time and UID of the file's workstation (this guarantees uniqueness of UIDs per fil e without a central server assigning them).

The Andrew file system [4] uses an internal 96-bit identifier for uniquely identifying files. These identifiers are used in the background to refer to files, but are never shown to users. Andrew clients see a partitioned namespace comprised of a local and shared namespace. The shared namespace is identical on all workstations, managed by a central server which can be replicated. The local namespace is typically only used for files required to boot an Andrew client, and to initialize the distributed client operation.

Cryptographic Naming

OceanStore [5] stores objects at the lowest level by identifying them with a globally unique identifier (GUID). GUIDs are convenient in distributed systems because they do not require a central authority to give them out. This allows any client on the system to autonomously generate a valid GUID with low probability of collisions (GUIDs are typically long bit strings e.g., more than 128 bits). At the same time, the benefit of an autonomous, de-centralized namespace management allows for malicious clients to hijack someone else's namespace and intentionally create collisions. To address this issue, OceanStore uses a technique proposed by Mazieres et al. [7] called self-certifying path names .

Self-certifying pathnames have all the benefits of public key cryptography without the burden of key management, which is known to be difficult, especially at a very large scale. One of the design goals of self-certifying pathnames is for clients to cryptographically verify the contents of any file on the network, without requiring exernal information. The novelty of this approach is that file names inherently contain all information necessary to communicate with remote servers. Essentially, an object's GUID is the secure hash (SHA-1 or similar) of the object's owner's key and some human readable name. By embedding a client key into the GUID, servers and other clients can verify the identity and ownership of an object without querying a third-party server.

Freenet [2] also uses keypair-based naming but in a slightly different way than OceanStore. Freenet identifies all files by a binary key which is obtained by applying a hash function. There are three types of keys in this distributed file system:

Keyword-signed key (KSK) This is the simplest identifier because it is derived from an arbitrary text string chosen by the user who is storing the file on the network. A user storing a PDF document might use the text string "freenet/distributed/file/system" to describe the file. The string is used to deterministically generate a private/public keypair. The public part of the key is hashed and becomes the file identifier.

We note that files can be recovered by guessing or bruteforcing the text string. Also, nothing stops two different users from coming up with the same descriptive string, and the second user's file would be rejected by the system, as there would be a collision in the namespace.

Signed-subspace key (SSK) This method enables personal namespaces for users. For this to work, users generate a public/private keypair using a good random number generator. The user also creates a descriptive text string, but in this case, it is XORed with the public key to generate the file key. This method allows users to manage their own namespace (i.e., collisions can still occur locally if the user picks the same string for two files). Users can also publish a list of keywords and a public key if they want to make those files publicly available.

Content-hash key (CHK) In this method, the file key is derived by hashing the contents of file. Files are also encrypted with a random encryption key specific to that file. For others to retrieve the file, the owner makes available the file hash along with the decryption key.

Hierarchical naming

Cheriton et al. [1] suggest naming objects using a long name which includes multiple pieces of information: (1) the resource's name and location on the file server where it resides; (2) the organization where that file server is located; and (3) a global administrative domain representing all the organizations participating the distributed file system. For example a file name of "[edu/standford/server4/bin/listdir" is split into:[edu (Gobal domain), /stanford/server4 (organization domain), and /bin/listdir (directory and file)

This naming scheme gives clients all the necessary information (using only the file name) to locate a file in a globally distributed file system. While this may seem like a good solution, there a few inherent limitations to the proposal.

First, file replication and load balancing can only be done at the lowest level (i.e., in the file server selected by the organization hosting the file). This can lead to a bottleneck when multiple files in the same organization become "hot". The authors suggest using caching and multicast to improve performance and avoid congestion on inter-organization links. Second, it requires all organizations participating in the system to agree or regulate the common namespace, much like the current Domain Name System (DNS). For this to work there must be an organization in which each stakeholder in the system is equally represented. While systems like these do exist currently (e.g., ICANN (The Internet Corporation for Assigned Names and Numbers (ICANN) is a non-profit organization that represents regional registrars, the Internet Engineering Task Force (IETF), Internet users and providers to help keep the Internet secure, stable and inter-operable.)), they have large amounts of administrative overhead and therefore limit the speed at which changes to deployed implementations can take place.

One advantage of the approach of Cheriton et al. is that names and directory structures must only be unique within an organization/server. The system as a whole does not have to keep track of every organization-level implementation, yet different organizations should still be able to exchange data.

Metadata Servers

The Google File System (GFS) [3] takes a different approach to naming files. GFS assumes that all the clients communicate with a single master server, who keeps a table mapping full pathnames to metadata (file locks and location). The namespace is therefore centrally managed, and all clients must register file operations with the master before they can be performed. While this architecture has an obvious central point of failure (which can be addressed by replication), it has the advantage of not having to deal with a distributed namespace. This central design also has the advantage of improving data consistency across multi-level distribution nodes. It also allows data to be moved to optimal nodes to increase performance or distribute load. It's worth noting that lookup tables are a fundamentally different way to find contents in a directory as compared to UNIX inodes and related data structures. This approach has inherent limitations such as not being able to support symlinks .

Ceph [11] client nodes use near-POSIX file system interfaces which are relayed back to a central metadata cluster. The metadata cluster is responsible for managing the system-wide namespace, coordinating security and verifying consistency. Ceph decouples data from metadata which enables the system to also distribute metadata servers themselves. The metadata servers store pointers to "object-storage clusters" which hold the actual data portion of the file. The metadata servers also handle file read and write operations, which then redirect clients to the appropriate object storage cluster or device.

Locating Resources

Local File Systems

In some distributed systems, files are copied locally and replicated to remote servers in the background. NFS [9] is one example where clients mount the remote file system locally. The remote directory structure is mapped on to a local namespace which makes files transparently accessible to clients. In this scheme, there is no need for distributing indexes or metadata, since all files appear to be local. A client can find files on the "distributed" file system in the same way local files are found.

Metadata Servers

File systems that use lookup tables for storing the location and metadatada of files (e.g., [3,11]) can locate resources trivially by querying the lookup table. The table usually contains a pointer to either the file itself or a server hosting that file who can in turn handle the file operation request.

A very basic implementation of a metadata lookup is used in the Apollo Domain File System [6]. A central name server maps client-readable strings (e.g., "/home/dbarrera/file1" ) to UIDs. The name server can be distributed by replicating it a multiple locations, allowing clients to query the nearest server instead of a central one.

The Andrew file system [4] uses unique file identifiers to populate a location database on the central server which maps file identifiers to locations. The server is therefore responsible for forwarding file access requests to the correct client hosting that file.

Distributed Index Search

Systems like Freenet [2] by design want to make it difficult for unauthorized users to access restricted files. This is a difficult problem, since the system aims to be highly distributed, but at the same time provide guarantees that files won't be read or modified by unauthorized third-parties. However, Freenet has developed an interesting approach to locating files: when a file is requested from the network, a user must first obtain or calculate the file key. The user's node requests that file from neighboring nodes, who in turn check if the file is stored locally, and if not forward the request to the next nearest neighbor. If a node cannot forward a request any longer (because a loop would be created or all nodes have already been queried), then a failure message is transmitted back to the previous node. If a file is found at some point along the request path, then the file is sent back through all the intermediate nodes until it reaches the request originator, which allows these intermediate nodes to keep a copy of the file as a cache. The next time that file key is requested, a node which is closer might have it, which will increase the retrieval speed. Nodes "forget" about cached copies of files in a least recently used (LRU) manner, allowing the network to automatically balance load and use available space optimally.

Distributing a file index was proposed Plaxton et al. [8] as well. Their proposal however attempts have all nodes in the network maintain a virtual tree . The tree information is distributed such that each node knows about copies of files residing on itself and all nodes that form the subtree rooted at that node. All nodes are constantly being updated with neighbor information, meaning that new nodes slowly obtain tree information to become the roots of their subtrees. This method has the advantage of distributing load and providing a hierarchical search functionality that can use well known algorithms (BFS, DFS) to find resources on a network.

Pseudo-random Data Distribution

Ceph [11] distributes data through a method that maximizes bandwidth and efficiently uses storage resources. Ceph also avoids data imbalance (e.g., new devices are under-used) and load-asymmetries (e.g., often requested data placed on only new devices) with a globally known algorithm called CRUSH (Controlled Replication Under Scalable Hashing). By using a predefined number of placement groups (the smallest unit of object storage groups), the CRUSH algorithm stores and replicates data across the network in a pseudo-random way. This algorithm tells the metadata servers both where the data should be stored and where it can be found later, which helps clients and metadata servers in locating resources.

Conclusions

This paper has presented a brief survey of distributed file system research conducted over the past 20 years. A wide range of distributed file systems have been designed to have varying levels of scalability, usability and efficiency. Depending on the requirements of a distributed file system, different approaches may be taken to address two main concerns: file naming and file retrieval. Unfortunately there is no clear winner in either of these categories, which means that selecting the "right" method for a given file system will always depend on the requirements and users of that system.

References

[1] D. R. Cheriton and T. P. Mann. Decentralizing a global naming service for improved performance and fault tolerance. ACM Transactions on Computer Systems, 7:147–183, 1989.

[2] I. Clarke, O. Sandberg, B. Wiley, and T. Hong. Freenet: A distributed anonymous information storage and retrieval system. In Designing Privacy Enhancing Technologies, pages 46–66. Springer, 2001.

[3] S. Ghemawat, H. Gobioﬀ, and S. Leung. The Google ﬁle system. ACM SIGOPS Operating Systems Review, 37(5):29–43, 2003.

[4] J. Howard and C.-M. U. I. T. Center. An overview of the Andrew ﬁle system. Citeseer, 1988.

[5] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, C. Wells, et al. Oceanstore: An architecture for global-scale persistent storage. ACM SIGARCH Computer Architecture News, 28(5):190–201, 2000.

[6] P. Levine. The Apollo DOMAIN Distributed File System. NATO ASI Series: Theory and Practice of Distributed Operating Systems, Y. Paker, JP. Banatre, M. Bozyi git, pages 241–260.

[7] D. Mazieres, M. Kaminsky, M. Kaashoek, and E. Witchel. Separating key management from ﬁle system security. ACM SIGOPS Operating Systems Review, 33(5):124–139, 1999.

[8] C. G. Plaxton, R. Rajaraman, A. W. Richa, and A. W. Richa. Accessing nearby copies of replicated objects in a distributed environment. pages 311–320, 1997.

[9] M. Satyanarayanan. A survey of distributed ﬁle systems. Annual Review of Computer Science, 4(1):73–104, 1990.

[10] M. Satyanarayanan, J. Kistler, P. Kumar, M. Okasaki, E. Siegel, and D. Steere. Coda: a highly available file system for a distributed workstation environment. Computers, IEEE Transactions on, 39(4):447–459, Apr. 1990.

[11] S. Weil, S. Brandt, E. Miller, D. Long, and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 307–320. USENIX Association, 2006.