DistOS-2011W Distributed File Sharing
Author: Omi Iyamu oiyamu@gmail.com
PDF available at [PDF]
Abstract
File sharing is a tool necessary for group collaboration, a simple way to make your files available to others, and nice way to access file contents across multiple machines. This paper discusses on a high-level the different file-sharing systems currently being used and the different strategies they employ to facilitate file sharing. In section 2, different file sharing systems are categorized based on scale into Local Area Network sharing and Internet based sharing. Section 3 discusses the steps involved in the process of sharing an actual file using the different file sharing systems discussed previously in section 2. Finally in section 4, this paper discusses the challenges that need to be overcome to develop an effective file sharing system for a distributed operating system and gives some suggestions to how some of them may be overcome.
1.0 Introduction
File sharing in a distributed environment should differ from that in a local environment. In this paper, whenever a mention of a distributed operating system is made, it will be done so with reference to an Internet based operating system. As such, the distributed environment that will be talked about will be the Internet. Whenever a local environment is mentioned, it will be done so with reference to a local area network.
The scope of this paper is just a review of a few file-sharing systems. The motivation is to determine what challenges need to be addressed in the development of a file sharing system that can be deployed on a distributed operating system.
Discussions in this paper will be on a high level in order to enable readers that do not have strong technical background ease of understanding. However, a small level of computer science or similar background is needed.
2.0 File Sharing systems
The main differences between different file sharing systems are the modes of access and the methods used to transfer the shared files. There are numerous types of file sharing systems out there; I have categorized them into two types based on scale. Section 2.1 talks about Local Area Network sharing, which can be considered as a small-scale file sharing system. Section 2.2 talks about Internet based file-sharing systems, which can be considered large scale file sharing.
==
Local Naming
The Sun Network File System (NFS) specifies that each client sees a UNIX file namespace with a private root. Due to each client being free to manage its own namespace, several workstations mounting the same remote directory might not have the same view of the files contained in that directory. However, if file-sharing or location transparency is required, it can be achieved by convention (e.g., users agreeing on calling a file a specific name) rather than by design.
One of the first distributed file systems, the Apollo DOMAIN File System [6] uses 64-bit unique identifiers (UIDs) for every object in the system. Each Apollo client also has a UID created the time of its manufacture. When a new file is created, the UID for that file is derived from the time and UID of the file's workstation (this guarantees uniqueness of UIDs per fil e without a central server assigning them).
The Andrew file system [4] uses an internal 96-bit identifier for uniquely identifying files. These identifiers are used in the background to refer to files, but are never shown to users. Andrew clients see a partitioned namespace comprised of a local and shared namespace. The shared namespace is identical on all workstations, managed by a central server which can be replicated. The local namespace is typically only used for files required to boot an Andrew client, and to initialize the distributed client operation.
Cryptographic Naming
OceanStore [5] stores objects at the lowest level by identifying them with a globally unique identifier (GUID). GUIDs are convenient in distributed systems because they do not require a central authority to give them out. This allows any client on the system to autonomously generate a valid GUID with low probability of collisions (GUIDs are typically long bit strings e.g., more than 128 bits). At the same time, the benefit of an autonomous, de-centralized namespace management allows for malicious clients to hijack someone else's namespace and intentionally create collisions. To address this issue, OceanStore uses a technique proposed by Mazieres et al. [7] called self-certifying path names .
Self-certifying pathnames have all the benefits of public key cryptography without the burden of key management, which is known to be difficult, especially at a very large scale. One of the design goals of self-certifying pathnames is for clients to cryptographically verify the contents of any file on the network, without requiring exernal information. The novelty of this approach is that file names inherently contain all information necessary to communicate with remote servers. Essentially, an object's GUID is the secure hash (SHA-1 or similar) of the object's owner's key and some human readable name. By embedding a client key into the GUID, servers and other clients can verify the identity and ownership of an object without querying a third-party server.
Freenet [2] also uses keypair-based naming but in a slightly different way than OceanStore. Freenet identifies all files by a binary key which is obtained by applying a hash function. There are three types of keys in this distributed file system:
Keyword-signed key (KSK) This is the simplest identifier because it is derived from an arbitrary text string chosen by the user who is storing the file on the network. A user storing a PDF document might use the text string "freenet/distributed/file/system" to describe the file. The string is used to deterministically generate a private/public keypair. The public part of the key is hashed and becomes the file identifier.
We note that files can be recovered by guessing or bruteforcing the text string. Also, nothing stops two different users from coming up with the same descriptive string, and the second user's file would be rejected by the system, as there would be a collision in the namespace.
Signed-subspace key (SSK) This method enables personal namespaces for users. For this to work, users generate a public/private keypair using a good random number generator. The user also creates a descriptive text string, but in this case, it is XORed with the public key to generate the file key. This method allows users to manage their own namespace (i.e., collisions can still occur locally if the user picks the same string for two files). Users can also publish a list of keywords and a public key if they want to make those files publicly available.
Content-hash key (CHK) In this method, the file key is derived by hashing the contents of file. Files are also encrypted with a random encryption key specific to that file. For others to retrieve the file, the owner makes available the file hash along with the decryption key.
Hierarchical naming
Cheriton et al. [1] suggest naming objects using a long name which includes multiple pieces of information: (1) the resource's name and location on the file server where it resides; (2) the organization where that file server is located; and (3) a global administrative domain representing all the organizations participating the distributed file system. For example a file name of "[edu/standford/server4/bin/listdir" is split into:[edu (Gobal domain), /stanford/server4 (organization domain), and /bin/listdir (directory and file)
This naming scheme gives clients all the necessary information (using only the file name) to locate a file in a globally distributed file system. While this may seem like a good solution, there a few inherent limitations to the proposal.
First, file replication and load balancing can only be done at the lowest level (i.e., in the file server selected by the organization hosting the file). This can lead to a bottleneck when multiple files in the same organization become "hot". The authors suggest using caching and multicast to improve performance and avoid congestion on inter-organization links. Second, it requires all organizations participating in the system to agree or regulate the common namespace, much like the current Domain Name System (DNS). For this to work there must be an organization in which each stakeholder in the system is equally represented. While systems like these do exist currently (e.g., ICANN (The Internet Corporation for Assigned Names and Numbers (ICANN) is a non-profit organization that represents regional registrars, the Internet Engineering Task Force (IETF), Internet users and providers to help keep the Internet secure, stable and inter-operable.)), they have large amounts of administrative overhead and therefore limit the speed at which changes to deployed implementations can take place.
One advantage of the approach of Cheriton et al. is that names and directory structures must only be unique within an organization/server. The system as a whole does not have to keep track of every organization-level implementation, yet different organizations should still be able to exchange data.
Metadata Servers
The Google File System (GFS) [3] takes a different approach to naming files. GFS assumes that all the clients communicate with a single master server, who keeps a table mapping full pathnames to metadata (file locks and location). The namespace is therefore centrally managed, and all clients must register file operations with the master before they can be performed. While this architecture has an obvious central point of failure (which can be addressed by replication), it has the advantage of not having to deal with a distributed namespace. This central design also has the advantage of improving data consistency across multi-level distribution nodes. It also allows data to be moved to optimal nodes to increase performance or distribute load. It's worth noting that lookup tables are a fundamentally different way to find contents in a directory as compared to UNIX inodes and related data structures. This approach has inherent limitations such as not being able to support symlinks .
Ceph [11] client nodes use near-POSIX file system interfaces which are relayed back to a central metadata cluster. The metadata cluster is responsible for managing the system-wide namespace, coordinating security and verifying consistency. Ceph decouples data from metadata which enables the system to also distribute metadata servers themselves. The metadata servers store pointers to "object-storage clusters" which hold the actual data portion of the file. The metadata servers also handle file read and write operations, which then redirect clients to the appropriate object storage cluster or device.
Locating Resources
Local File Systems
In some distributed systems, files are copied locally and replicated to remote servers in the background. NFS [9] is one example where clients mount the remote file system locally. The remote directory structure is mapped on to a local namespace which makes files transparently accessible to clients. In this scheme, there is no need for distributing indexes or metadata, since all files appear to be local. A client can find files on the "distributed" file system in the same way local files are found.
Metadata Servers
File systems that use lookup tables for storing the location and metadatada of files (e.g., [3,11]) can locate resources trivially by querying the lookup table. The table usually contains a pointer to either the file itself or a server hosting that file who can in turn handle the file operation request.
A very basic implementation of a metadata lookup is used in the Apollo Domain File System [6]. A central name server maps client-readable strings (e.g., "/home/dbarrera/file1" ) to UIDs. The name server can be distributed by replicating it a multiple locations, allowing clients to query the nearest server instead of a central one.
The Andrew file system [4] uses unique file identifiers to populate a location database on the central server which maps file identifiers to locations. The server is therefore responsible for forwarding file access requests to the correct client hosting that file.
Distributed Index Search
Systems like Freenet [2] by design want to make it difficult for unauthorized users to access restricted files. This is a difficult problem, since the system aims to be highly distributed, but at the same time provide guarantees that files won't be read or modified by unauthorized third-parties. However, Freenet has developed an interesting approach to locating files: when a file is requested from the network, a user must first obtain or calculate the file key. The user's node requests that file from neighboring nodes, who in turn check if the file is stored locally, and if not forward the request to the next nearest neighbor. If a node cannot forward a request any longer (because a loop would be created or all nodes have already been queried), then a failure message is transmitted back to the previous node. If a file is found at some point along the request path, then the file is sent back through all the intermediate nodes until it reaches the request originator, which allows these intermediate nodes to keep a copy of the file as a cache. The next time that file key is requested, a node which is closer might have it, which will increase the retrieval speed. Nodes "forget" about cached copies of files in a least recently used (LRU) manner, allowing the network to automatically balance load and use available space optimally.
Distributing a file index was proposed Plaxton et al. [8] as well. Their proposal however attempts have all nodes in the network maintain a virtual tree . The tree information is distributed such that each node knows about copies of files residing on itself and all nodes that form the subtree rooted at that node. All nodes are constantly being updated with neighbor information, meaning that new nodes slowly obtain tree information to become the roots of their subtrees. This method has the advantage of distributing load and providing a hierarchical search functionality that can use well known algorithms (BFS, DFS) to find resources on a network.
Pseudo-random Data Distribution
Ceph [11] distributes data through a method that maximizes bandwidth and efficiently uses storage resources. Ceph also avoids data imbalance (e.g., new devices are under-used) and load-asymmetries (e.g., often requested data placed on only new devices) with a globally known algorithm called CRUSH (Controlled Replication Under Scalable Hashing). By using a predefined number of placement groups (the smallest unit of object storage groups), the CRUSH algorithm stores and replicates data across the network in a pseudo-random way. This algorithm tells the metadata servers both where the data should be stored and where it can be found later, which helps clients and metadata servers in locating resources.
Conclusions
This paper has presented a brief survey of distributed file system research conducted over the past 20 years. A wide range of distributed file systems have been designed to have varying levels of scalability, usability and efficiency. Depending on the requirements of a distributed file system, different approaches may be taken to address two main concerns: file naming and file retrieval. Unfortunately there is no clear winner in either of these categories, which means that selecting the "right" method for a given file system will always depend on the requirements and users of that system.
References
[1] D. R. Cheriton and T. P. Mann. Decentralizing a global naming service for improved performance and fault tolerance. ACM Transactions on Computer Systems, 7:147–183, 1989.
[2] I. Clarke, O. Sandberg, B. Wiley, and T. Hong. Freenet: A distributed anonymous information storage and retrieval system. In Designing Privacy Enhancing Technologies, pages 46–66. Springer, 2001.
[3] S. Ghemawat, H. Gobioff, and S. Leung. The Google file system. ACM SIGOPS Operating Systems Review, 37(5):29–43, 2003.
[4] J. Howard and C.-M. U. I. T. Center. An overview of the Andrew file system. Citeseer, 1988.
[5] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, C. Wells, et al. Oceanstore: An architecture for global-scale persistent storage. ACM SIGARCH Computer Architecture News, 28(5):190–201, 2000.
[6] P. Levine. The Apollo DOMAIN Distributed File System. NATO ASI Series: Theory and Practice of Distributed Operating Systems, Y. Paker, JP. Banatre, M. Bozyi git, pages 241–260.
[7] D. Mazieres, M. Kaminsky, M. Kaashoek, and E. Witchel. Separating key management from file system security. ACM SIGOPS Operating Systems Review, 33(5):124–139, 1999.
[8] C. G. Plaxton, R. Rajaraman, A. W. Richa, and A. W. Richa. Accessing nearby copies of replicated objects in a distributed environment. pages 311–320, 1997.
[9] M. Satyanarayanan. A survey of distributed file systems. Annual Review of Computer Science, 4(1):73–104, 1990.
[10] M. Satyanarayanan, J. Kistler, P. Kumar, M. Okasaki, E. Siegel, and D. Steere. Coda: a highly available file system for a distributed workstation environment. Computers, IEEE Transactions on, 39(4):447–459, Apr. 1990.
[11] S. Weil, S. Brandt, E. Miller, D. Long, and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 307–320. USENIX Association, 2006.