Difference between revisions of "DistOS-2011W Distributed File Sharing"

From Soma-notes
Jump to navigation Jump to search
(Created page with "Author: Omi Iyamu oiyamu@gmail.com PDF available at [PDF] =Abstract= File sharing is a tool necessary for group collaboration, a simple way to make your files available to …")
 
Line 3: Line 3:


PDF available at [PDF]
PDF available at [PDF]
=Abstract=
    File sharing is a tool necessary for group collaboration, a simple way to make your files available to others, and nice way to access file contents across multiple machines. This paper discusses on a high-level the different file-sharing systems currently being used and the different strategies they employ to facilitate file sharing. In section 2, different file sharing systems are categorized based on scale into Local Area Network sharing and Internet based sharing. Section 3 discusses the steps involved in the process of sharing an actual file using the different file sharing systems discussed previously in section 2. Finally in section 4, this paper discusses the challenges that need to be overcome to develop an effective file sharing system for a distributed operating system and gives some suggestions to how some of them may be overcome.


*Abstract*


=1.0 Introduction=
File sharing is a tool necessary for group collaboration, a simple way to make your files available to others, and nice way to access file contents across multiple machines. This paper discusses on a high-level the different file-sharing systems currently being used and the different strategies they employ to facilitate file sharing. In section 2, different file sharing systems are categorized based on scale into Local Area Network sharing and Internet based sharing. Section 3 discusses the steps involved in the process of sharing an actual file using the different file sharing systems discussed previously in section 2. Finally in section 4, this paper discusses the challenges that need to be overcome to develop an effective file sharing system for a distributed operating system and gives some suggestions to how some of them may be overcome.
    File sharing in a distributed environment should differ from that in a local environment. In this paper, whenever a mention of a distributed operating system is made, it will be done so with reference to an Internet based operating system. As such, the distributed environment that will be talked about will be the Internet. Whenever a local environment is mentioned, it will be done so with reference to a local area network.
    The scope of this paper is just a review of a few file-sharing systems. The motivation is to determine what challenges need to be addressed in the development of a file sharing system that can be deployed on a distributed operating system.  


    Discussions in this paper will be on a high level in order to enable readers that do not have strong technical background ease of understanding. However, a small level of computer science or similar background is needed.


=Introduction=
File sharing in a distributed environment should differ from that in a local environment. In this paper, whenever a mention of a distributed operating system is made, it will be done so with reference to an Internet based operating system. As such, the distributed environment that will be talked about will be the Internet. Whenever a local environment is mentioned, it will be done so with reference to a local area network.
The scope of this paper is just a review of a few file-sharing systems. The motivation is to determine what challenges need to be addressed in the development of a file sharing system that can be deployed on a distributed operating system.
Discussions in this paper will be on a high level in order to enable readers that do not have strong technical background ease of understanding. However, a small level of computer science or similar background is needed.
=File Sharing systems=


=2.0 File Sharing systems=
The main differences between different file sharing systems are the modes of access and the methods used to transfer the shared files. There are numerous types of file sharing systems out there; I have categorized them into two types based on scale. Section 2.1 talks about Local Area Network sharing, which can be considered as a small-scale file sharing system. Section 2.2 talks about Internet based file-sharing systems, which can be considered large scale file sharing.
The main differences between different file sharing systems are the modes of access and the methods used to transfer the shared files. There are numerous types of file sharing systems out there; I have categorized them into two types based on scale. Section 2.1 talks about Local Area Network sharing, which can be considered as a small-scale file sharing system. Section 2.2 talks about Internet based file-sharing systems, which can be considered large scale file sharing.


==
==Local Area Network Sharing==
 
On a Local Area Network (LAN), the computers present on a LAN have some degree of trust between them. The key advantages to using sharing systems designed for Local Area Networks is the ability to set access restrictions to files being shared and increased transfer speeds. Examples of such are AFP (Apple Filing Protocol) used by Apple and SMB (Server Message Block) used by Windows.
 
==Internet Based File Sharing==
 
There are a number of Internet based or online file sharing systems that take different approaches to file sharing. Some examples are peer-2-peer networks, discussed in section 2.2.1, and FTP (File Transfer Protocol), discussed in section 2.2.2.
 
===Peer-2-peer Systems===
 
Peer-2-peer is one of the most commonly used file sharing systems out there. User computers act as both client and server nodes and share content in between themselves. There are two main styles to which peer-2-peer file-sharing systems work by, one involves the use of torrents and the other does not.
 
* Torrent style
Out of all the torrent based peer-2-peer networks Bit-torrent by is the most commonly used today [1]. In itself, Bit-torrent is just a file downloading protocol that enables simulations downloading from different sources holding the exact same file.
 
* Non-torrent style
This is more of the older style peer-2-pper networks like Kazaa. Unlike torrent networks, there is a centralized server that holds information about who is sharing what files and downloading is done from one single computer to another single computer.
 
===File Transfer Protocol===
 
FTP as the name suggests is a file transfer protocol. File transfer is made from a single computer source to a single receiving computer. FTP file systems are often password protected, this is to ensure only authorized users access the files. To access an FTP file system you need to know the IP address or the domain name to the computer to which you want to access. When a file is requested for, the complete file is downloaded onto the requesting computer.
 
=File Sharing Process=
 
There are numerous file sharing protocols available and can generally be broken up into three main steps, the sharing of the file itself, the finding for the shared file, and the accessing or transferring of the shared file. In this section we will be discussing the process for peer-2-peer networks and Local Area Networks.


==Local Naming==
==Sharing the file==
The Sun Network File System (NFS) specifies that each client sees a UNIX file
namespace with a private root. Due to each client being free to manage
its own namespace, several workstations mounting the same remote directory
might not have the same view of the files contained in that directory. However,
if file-sharing or location transparency is required, it can be achieved by
convention (e.g., users agreeing on calling a file a specific name) rather than
by design.


One of the first distributed file systems, the Apollo DOMAIN File System
The sharing of the actual file is the process of setting up a file for sharing. Different file sharing systems follow a different process of actually getting a file to be enabled for sharing.
[6] uses 64-bit unique identifiers (UIDs) for every object in the
system. Each Apollo client also has a UID created the time of its manufacture.
When a new file is created, the UID for that file is derived from the time and
UID of the file's workstation (this guarantees uniqueness of UIDs per fil
e without a
central server assigning them).  


The Andrew file system [4] uses an internal 96-bit identifier for
===Peer-2-peer sharing===
uniquely identifying files. These identifiers are used in the background to
refer to files, but are never shown to users. Andrew clients see a partitioned
namespace comprised of a local and shared namespace. The shared namespace is
identical on all workstations, managed by a central server which can be
replicated. The local namespace is typically only used for files required to
boot an Andrew client, and to initialize the distributed client operation.


==Cryptographic Naming==
Peer-2-peer torrent networks generally follow a submission process towards file sharing. With Bit torrent, a user injects new content buy uploading a torrent file to a torrent search website such as supernova.com and creating a seed with the first copy of the file [1]. Bit torrent has a mediator system that checks the content of files to make sure they are what they say they are. When a user submits a new file, a mediator has to check it before it is allowed into the sharing network. After a user has submitted several files that passed mediation, he will then be promoted to unmediated submitter status. This means the user is trusted enough to submit files that will be directly injected into the sharing network without having to be mediated [1]. Non-torrent peer-2-peer networks don’t follow this submission system; all you have to do to share a file is usually just to place it in the share directory used buy the third-party peer-2-peer application.
OceanStore [5] stores objects at the lowest level by identifying
them with a
globally unique identifier (GUID). GUIDs are convenient in distributed
systems because they do not require a central authority to give them out. This
allows any client on the system to autonomously generate a valid GUID
with low probability of collisions (GUIDs are typically long bit strings e.g.,
more than 128 bits). At the same time, the benefit of an autonomous,
de-centralized namespace management allows for malicious clients to hijack
someone else's namespace and intentionally create collisions. To address this
issue, OceanStore uses a technique proposed by Mazieres et al. [7]
called
''self-certifying path names'' .


Self-certifying pathnames have all the benefits of public key cryptography
There is no notion of setting access restrictions with peer-2-peer file sharing. Users generally have unrestricted access to shared content; they can be downloaded, edited, and re-uploaded by all.
without the burden of key management, which is known to be difficult,
especially at a very large scale. One of the design goals of self-certifying
pathnames is for clients to cryptographically verify the contents of any file
on the network, without requiring exernal information. The novelty of this
approach is that file names inherently contain all information necessary to
communicate with remote servers. Essentially, an object's GUID is the secure
hash (SHA-1 or similar) of the object's owner's key and some human readable
name. By embedding a client key into the GUID, servers and other clients can
verify the identity and ownership of an object without querying a
third-party server.


Freenet [2] also uses keypair-based naming but in a slightly
===Local Area Network sharing===
different way than OceanStore. Freenet identifies all files by a binary key
which is obtained by applying a hash function. There are three types of keys in
this distributed file system:


'''Keyword-signed key (KSK)''' This is the simplest identifier because it
In local Area Networks, setting up a file to be shared does not involve any submission process or mediation. Being that members of the network have some level of trust between them, to setup a file for sharing, all you have to do is go into the file’s properties and enable its sharing property. Access restrictions can also be set to restrict read and or write properties of the files or directories being shared.
is derived from an arbitrary text string chosen by the user who is storing the
file on the network. A user storing a PDF document might use the text string
"freenet/distributed/file/system" to describe the file. The string is used to
deterministically generate a private/public keypair. The public part of the key
is hashed and becomes the file identifier.  


We note that files can be recovered by guessing or bruteforcing the text
* Read only
string. Also, nothing stops two different users from coming up with the same
In this setting the user is only allowed to view contents of the file. This is to say that no changes can be made to the root file. The only way around this is to copy the particular file over and make changes to your local copy.
descriptive string, and the second user's file would be rejected by the system,
as there would be a collision in the namespace.


'''Signed-subspace key (SSK)''' This method enables personal namespaces
* Write only
for users. For this to work, users generate a public/private keypair using a
This setting is used on directories. In this setting a directory will be turned into a drop box. That is to say another user on the network can write files to the given directory but cannot view the contents of the directory. Access to read the contents of the directory is only for the owner of the directory.
good random number generator. The user also creates a descriptive text string,
but in this case, it is XORed with the public key to generate the file key.
This method allows users to manage their own namespace (i.e., collisions can
still occur locally if the user picks the same string for two files). Users can
also
publish a list of keywords and a public key if they want to make those files
publicly available.  


'''Content-hash key (CHK)''' In this method, the file key is derived by
* Read and Write
hashing the contents of file. Files are also encrypted with a random encryption
This setting will allow the user to make changes the file, and save these changes on to the root file. In this, the file does not need to be copied over. In a directory case, contents of the directory can be modified remotely.
key specific to that file. For others to retrieve the file, the owner makes
available the file hash along with the decryption key.


==Hierarchical naming==
==Locating shared files==
Cheriton et al. [1] suggest naming objects using a long
name which includes multiple pieces of information: (1) the resource's name
and location on the file server where it resides; (2) the organization where
that file server is located; and (3) a global administrative domain
representing all the organizations participating the distributed file system.
For example a file name of "[edu/standford/server4/bin/listdir"  is split
into:[edu (Gobal domain), /stanford/server4 (organization domain), and /bin/listdir (directory and file)


This naming scheme gives clients all the necessary information (using only the
People share files so that themselves and or other people may access it remotely. As such, finding a file that has been shared is a key step in the process of sharing. Methods of locating shared files differ between sharing systems.
file name) to locate a file in a globally distributed file system. While this
may seem like a good solution, there a few inherent limitations to the
proposal.


First, file replication and load balancing can only be done at the lowest level
===Peer-2-peer file search===
(i.e., in the file server selected by the organization hosting the file). This
can lead to a bottleneck when multiple files in the same organization become
"hot". The authors suggest using caching and multicast to improve performance
and avoid congestion on inter-organization links. Second, it requires all
organizations participating in the system to agree or regulate the common
namespace, much like the current Domain Name System (DNS). For this to work
there must be an organization in which each stakeholder in the system is
equally represented. While systems like these do exist currently (e.g.,
ICANN (The Internet Corporation for Assigned Names and Numbers (ICANN)
is a non-profit organization that represents regional registrars, the Internet
Engineering Task Force (IETF), Internet users and providers to help keep the
Internet secure, stable and inter-operable.)), they have large amounts of
administrative overhead and therefore limit the speed at which changes to
deployed implementations can take place.


One advantage of the approach of Cheriton et al. is that names and directory
n peer-2-peer systems, finding the shared files you want is pretty easy. Non-torrent networks like Kazaa have a centralized server that holds lists of who is sharing what [3]. In order to search thorough this list, a third-party peer-2-peer application is needed. However cleaning of the file lists on these types of systems is poor which results in users sometimes downloading “fake” files.
structures must only be unique within an organization/server. The system as a
whole does not have to keep track of every organization-level implementation,
yet different organizations should still be able to exchange data.


==Metadata Servers==
In torrent networks like Bit-torrent where the shared files are checked on submission, the likelihood of downloading a fake file is reduced. However, searching for a shared file is done via third party search engines like supernova.com and isohunt.com.
The Google File System (GFS) [3] takes a different approach to
naming files. GFS assumes that all the clients communicate with a single master
server, who keeps a table mapping full pathnames to metadata (file locks and
location). The namespace is therefore centrally managed, and all clients must
register file operations with the master before they can be performed. While
this architecture has an obvious central point of failure (which can be
addressed by replication), it has the advantage of not having to deal with a
distributed namespace. This central design also has the advantage of improving
data consistency across multi-level distribution nodes. It also allows data
to be moved to optimal nodes to increase performance or distribute load. It's
worth noting that lookup tables are a fundamentally different way to find
contents in a directory as compared to UNIX ''inodes''  and related data
structures. This approach has inherent limitations such as not being able to
support symlinks .


Ceph [11] client nodes use near-POSIX file system interfaces which are
===Local Area Network file search===
relayed back to a central metadata cluster. The metadata cluster is responsible
for managing the system-wide namespace, coordinating security and verifying
consistency. Ceph decouples data from metadata which enables the system to also
distribute metadata servers themselves. The metadata servers store pointers to
"object-storage clusters" which hold the actual data portion of the file. The
metadata servers also handle file read and write operations, which then
redirect clients to the appropriate object storage cluster or device.


=Locating Resources=
In local area networks, in order to find shared files you need to know where the file is located. This is to say that if lets say you are looking for a particular file and you don’t know the location, you may have to comb through the entire network manually in search of this file.


==Local File Systems==
==Transferring the file==
In some distributed systems, files are copied locally and replicated to remote
servers in the background. NFS [9] is one example where clients
mount the remote file system locally. The remote directory structure is mapped
on to a local namespace which makes files transparently accessible to
clients. In this scheme, there is no need for distributing indexes or metadata,
since all files appear to be local. A client can find files on the
"distributed" file system in the same way local files are found.


==Metadata Servers==
In order to access a file over any network, some level of transfer needs to be made whether temporary or permanent. Files are transferred temporarily only if they only need to be viewed or edited. Files are transferred permanently if it is being copied or moved completely. File sharing systems like peer-2-peer only transfer files permanently, whereas most local file sharing systems over a local area network will only make a permanent transfer when a copy or cut command is executed.
File systems that use lookup tables for storing the
location and
metadatada of files (e.g., [3,11]) can locate resources trivially
by
querying the lookup table. The table usually contains a pointer to either the
file itself or a server hosting that file who can in turn handle the file
operation request.  


A very basic implementation of a metadata lookup is used in the Apollo Domain
===Peer-2-peer file transfer===
File System [6]. A central name server maps client-readable strings
(e.g., "/home/dbarrera/file1" ) to UIDs. The name server can be
distributed by replicating it a multiple locations, allowing clients to query
the nearest server instead of a central one.


The Andrew file system [4] uses unique file identifiers to
After the user has identified his target file. Depending on the type of the peer-2-peer network, there are two main ways the file can be transferred to the user.
populate a ''location database''  on the central server which maps file
identifiers to locations. The server is therefore responsible for forwarding
file access requests to the correct client hosting that file.


==Distributed Index Search==
* Single user to single user transfer
Systems like Freenet [2] by design want to make it difficult for
In this style of transfer, the complete file is downloaded from a single source. Non-torrent peer-2-peer networks use this style of transfer. Torrent networks only uses this style when dealing with shared files that only have a single seed.
unauthorized users to access restricted files. This is a difficult problem,
since the system aims to be highly distributed, but at the same time provide
guarantees that files won't be read or modified by unauthorized third-parties.
However, Freenet has developed an interesting approach to locating files: when
a file is requested from the network, a user must first obtain or calculate the
file key. The user's node requests that file
from neighboring nodes, who in turn check if the file is stored locally, and if
not forward the request to the next nearest neighbor. If a node cannot forward
a request any longer (because a loop would be created or all nodes have
already been queried), then a failure message is transmitted back to the
previous node. If a file is found at some point along the request path,
then the file is sent back through all the intermediate nodes until it reaches
the request originator, which allows these intermediate nodes to keep a copy of
the file as a cache. The next time that file key is requested, a node which is
closer might have it, which will increase the retrieval speed. Nodes
"forget" about cached copies of files in a least recently used (LRU) manner,
allowing the network to automatically  balance load and use available space
optimally.  


Distributing a file index was proposed Plaxton et al. [8] as well.
* Multiple users to single user transfer
Their proposal however attempts have all nodes in the network maintain a
In this style of transfer, the file is simultaneously downloaded from multiple sources. This is the style more used by torrent networks like Bit torrent. Files shared on torrent networks are split into chunks. The torrent file itself hold information about seeds for the particular shared file. As such, different chunks of the shared file is downloaded simultaneously onto the users computer and reassembled. This way much higher download speeds can be achieved compared to the single-to-single user transfers.
''virtual tree'' . The tree information is distributed such that each node
knows about copies of files residing on itself and all nodes that form the
subtree rooted at that node. All nodes are constantly being updated with
neighbor information, meaning that new nodes slowly obtain tree information to
become the roots of their subtrees. This method has the advantage of
distributing load and providing a hierarchical search functionality that can
use well known algorithms (BFS, DFS) to find resources on a network.


==Pseudo-random Data Distribution==
===Local operating system file transfer===
Ceph [11] distributes data through a method that maximizes bandwidth and
efficiently uses storage resources. Ceph also avoids data imbalance (e.g.,
new devices are under-used) and load-asymmetries (e.g., often requested data
placed on only new devices) with a globally known algorithm called CRUSH
(Controlled Replication Under Scalable Hashing). By using a predefined number
of ''placement groups''  (the smallest unit of object storage groups), the
CRUSH algorithm stores and replicates data across the network in a
pseudo-random way. This algorithm tells the metadata servers both where the
data should be stored and where it can be found later, which helps clients and
metadata servers in locating resources.


=Conclusions=
In a local area network setting, files are generally viewed from the root. Technically, the complete or portions of the file are transferred to main memory and then viewed form there, the same way it would if you had a local copy. The only difference being that instead of the transfer being made from your local storage (hard drive) to main memory, the transfer is from a remote storage device somewhere on the network to main memory. The only real reason why this can be done is that transfer speeds over a local network is faster than over the Internet. As such, access restrictions can properly be enforced.
This paper has presented a brief survey of distributed file system research
conducted over the past 20 years. A wide range of distributed file systems have
been designed to have varying levels of scalability, usability and efficiency.
Depending on the requirements of a distributed file system, different approaches
may be taken to address two main concerns: file naming and file retrieval.
Unfortunately there is no clear winner in either of these categories, which
means that selecting the "right" method for a given file system will always
depend on the requirements and users of that system.


=References=
=Sharing of Distributed Files=
[1] D. R. Cheriton and T. P. Mann. Decentralizing a global naming service for improved performance and fault tolerance. ACM Transactions on Computer Systems, 7:147–183, 1989.
 
When we think of file sharing we generally think of the file location being on our computer. With a distributed file system the location of the file to which we want to share most likely will not physically be on our computer. This brings a level of complexity to the actual sharing of the file.
 
Sharing of a file in a distributed operating system’s case will have to be scalable enough that it can be deployed over the Internet. This means that traditional AFP and SMB approaches will have difficulty scaling up to the task. Examples of file sharing systems that already work on this level as discussed are peer-2-peer networks and FTP. Defining an effective file sharing system for a distributed operating system the following challenges need to be addressed.
 
* Transfer speed
When a file is to be transferred it should be done so with the highest speed possible. A torrent approach may not necessarily be a complete answer as multiple copies of the file is needed to improve speed. This will be a huge problem with sensitive files in which a user may not want multiple copies of it located all over the internet.
 
* Duplicate files
As it is already, common files like music files may have millions of copies located on different computers all over the world. For a distributed file system, having so many copies of the same file is an ineffective use of space and should be avoided where possible.
 
* File integrity
Corrupted files or fake files are an issue in sharing because they may end up corrupting computers that access the file. One way this is mitigated today is through reporting systems in which users can report a fake or corrupted file to the host or source. Another approach is by plain old checking systems that go through files checking its integrity. In torrent systems, as previously discussed, mediators manually do the checking of files.
 
* File backup
This is a solution to help file integrity as well as data loss. If it is determined that a file has lost its integrity, there needs to be a mechanism to restore the integrity of the file. The easiest way to do this is to restore the file from a good backup. Data or file loss can happen in a lot of ways, for instance if a server in which the file is stored goes down. In this case, a back up copy needs to be located somewhere else that the user can access.
 
* Access restrictions
File sharing systems like FTP, AFP and SMB can restrict a users ability to access a particular file with authentication mechanisms. Having such capabilities in a distributed environment for sharing is certainly necessary in order to have a more flexible and restricted sharing ability. AFP and SMB take access restrictions further to also restrict read and write capabilities.
 
* Search capability
This can be looked at as more of a convenience measure than a need; it would be nice for a user to be able to search through all the shared files that he or she has access. Having this will certainly aid in the development of more user friendly distributed operating systems.


[2] I. Clarke, O. Sandberg, B. Wiley, and T. Hong. Freenet: A distributed anonymous information storage and retrieval system. In Designing Privacy Enhancing Technologies, pages 46–66. Springer, 2001.
=Conclusion=


[3] S. Ghemawat, H. Gobioff, and S. Leung. The Google file system. ACM SIGOPS Operating Systems Review, 37(5):29–43, 2003.
File sharing is a need necessary to accomplish many collaborative tasks not only in the work place, but in other areas as well. We have discussed the differences in some of the popular file sharing systems being used today like peer-2-peer networks and Local Area Network file sharing. The similarity between both of these is that the shared files are stored on the host computers. In a distributed environment this may not be the case. Through the study of the current file sharing systems, we have found that in order to develop an effective file sharing system for a distributed operating system, challenges such as, transfer speeds, duplicate files, file integrity, file backup, access restrictions, and search capabilities need to be addressed. Current file sharing systems address some of these issues but no single one addresses all of them properly. As such maybe a hybrid between the Local Area Network sharing and Internet based file sharing is needed.


[4] J. Howard and C.-M. U. I. T. Center. An overview of the Andrew file system. Citeseer, 1988.
=References=


[5] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, C. Wells, et al. Oceanstore: An architecture for global-scale persistent storage. ACM SIGARCH Computer Architecture News, 28(5):190–201, 2000.
[1] J. Pouwelse, P. Garbacki, D. Epema, H. Sips. The Bit-torrent P2P File-Sharing System. Delft University of Technology, Delft, The Netherlands.


[6] P. Levine. The Apollo DOMAIN Distributed File System. NATO ASI Series: Theory and Practice of Distributed Operating Systems, Y. Paker, JP. Banatre, M. Bozyi git, pages 241–260.
[2] R. Bhagwan, S. Savage, and G. M. Voelker. Understanding availability. In Inter- national Workshop on Peer to Peer Systems, Berkeley, CA, USA, February 2003.


[7] D. Mazieres, M. Kaminsky, M. Kaashoek, and E. Witchel. Separating key management from file system security. ACM SIGOPS Operating Systems Review, 33(5):124–139, 1999.
[3] B. Cohen. Incentives build robustness in bittorrent. In Workshop on Economics of Peer-to- Peer Systems, Berkeley, USA, May 2003.


[8] C. G. Plaxton, R. Rajaraman, A. W. Richa, and A. W. Richa. Accessing nearby copies of replicated objects in a distributed environment. pages 311–320, 1997.
[4] S. Saroiu, P. Krishna, G. Steven, D. Gribble. A Measurement Study of Peer-to-peer File Sharing Systems. University of Washington, Seattle, WA, USA.


[9] M. Satyanarayanan. A survey of distributed file systems. Annual Review of Computer Science, 4(1):73–104, 1990.
[5] N. Leibowitz, M. Ripeanu, and A. Wierzbicki. Deconstructing the kazaa network. In 3rd IEEE Workshop on Internet Applications (WIAPP’03), San Jose, CA, USA, June 2003.


[10] M. Satyanarayanan, J. Kistler, P. Kumar, M. Okasaki, E. Siegel, and D. Steere. Coda: a highly available file system for a distributed workstation environment. Computers, IEEE Transactions on, 39(4):447–459, Apr. 1990.
[6] R. Sherwood, R. Braud, and B. Bhattacharjee. Slurpie: A cooperative bulk data transfer protocol. In IEEE Infocom, Honk Kong, China, March 2004.


[11] S. Weil, S. Brandt, E. Miller, D. Long, and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 307–320. USENIX Association, 2006.
[7] B.T. Loo, J.M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica. Enhancing P2P File-Sharing with an Internet-Scale Query Processor.UC Berkeley. VLDB Conference, Toronto, Canada, 2004.

Revision as of 14:11, 13 March 2011

Author: Omi Iyamu oiyamu@gmail.com

PDF available at [PDF]

  • Abstract*

File sharing is a tool necessary for group collaboration, a simple way to make your files available to others, and nice way to access file contents across multiple machines. This paper discusses on a high-level the different file-sharing systems currently being used and the different strategies they employ to facilitate file sharing. In section 2, different file sharing systems are categorized based on scale into Local Area Network sharing and Internet based sharing. Section 3 discusses the steps involved in the process of sharing an actual file using the different file sharing systems discussed previously in section 2. Finally in section 4, this paper discusses the challenges that need to be overcome to develop an effective file sharing system for a distributed operating system and gives some suggestions to how some of them may be overcome.


Introduction

File sharing in a distributed environment should differ from that in a local environment. In this paper, whenever a mention of a distributed operating system is made, it will be done so with reference to an Internet based operating system. As such, the distributed environment that will be talked about will be the Internet. Whenever a local environment is mentioned, it will be done so with reference to a local area network.

The scope of this paper is just a review of a few file-sharing systems. The motivation is to determine what challenges need to be addressed in the development of a file sharing system that can be deployed on a distributed operating system.

Discussions in this paper will be on a high level in order to enable readers that do not have strong technical background ease of understanding. However, a small level of computer science or similar background is needed.


File Sharing systems

The main differences between different file sharing systems are the modes of access and the methods used to transfer the shared files. There are numerous types of file sharing systems out there; I have categorized them into two types based on scale. Section 2.1 talks about Local Area Network sharing, which can be considered as a small-scale file sharing system. Section 2.2 talks about Internet based file-sharing systems, which can be considered large scale file sharing.

Local Area Network Sharing

On a Local Area Network (LAN), the computers present on a LAN have some degree of trust between them. The key advantages to using sharing systems designed for Local Area Networks is the ability to set access restrictions to files being shared and increased transfer speeds. Examples of such are AFP (Apple Filing Protocol) used by Apple and SMB (Server Message Block) used by Windows.

Internet Based File Sharing

There are a number of Internet based or online file sharing systems that take different approaches to file sharing. Some examples are peer-2-peer networks, discussed in section 2.2.1, and FTP (File Transfer Protocol), discussed in section 2.2.2.

Peer-2-peer Systems

Peer-2-peer is one of the most commonly used file sharing systems out there. User computers act as both client and server nodes and share content in between themselves. There are two main styles to which peer-2-peer file-sharing systems work by, one involves the use of torrents and the other does not.

  • Torrent style

Out of all the torrent based peer-2-peer networks Bit-torrent by is the most commonly used today [1]. In itself, Bit-torrent is just a file downloading protocol that enables simulations downloading from different sources holding the exact same file.

  • Non-torrent style

This is more of the older style peer-2-pper networks like Kazaa. Unlike torrent networks, there is a centralized server that holds information about who is sharing what files and downloading is done from one single computer to another single computer.

File Transfer Protocol

FTP as the name suggests is a file transfer protocol. File transfer is made from a single computer source to a single receiving computer. FTP file systems are often password protected, this is to ensure only authorized users access the files. To access an FTP file system you need to know the IP address or the domain name to the computer to which you want to access. When a file is requested for, the complete file is downloaded onto the requesting computer.

File Sharing Process

There are numerous file sharing protocols available and can generally be broken up into three main steps, the sharing of the file itself, the finding for the shared file, and the accessing or transferring of the shared file. In this section we will be discussing the process for peer-2-peer networks and Local Area Networks.

Sharing the file

The sharing of the actual file is the process of setting up a file for sharing. Different file sharing systems follow a different process of actually getting a file to be enabled for sharing.

Peer-2-peer sharing

Peer-2-peer torrent networks generally follow a submission process towards file sharing. With Bit torrent, a user injects new content buy uploading a torrent file to a torrent search website such as supernova.com and creating a seed with the first copy of the file [1]. Bit torrent has a mediator system that checks the content of files to make sure they are what they say they are. When a user submits a new file, a mediator has to check it before it is allowed into the sharing network. After a user has submitted several files that passed mediation, he will then be promoted to unmediated submitter status. This means the user is trusted enough to submit files that will be directly injected into the sharing network without having to be mediated [1]. Non-torrent peer-2-peer networks don’t follow this submission system; all you have to do to share a file is usually just to place it in the share directory used buy the third-party peer-2-peer application.

There is no notion of setting access restrictions with peer-2-peer file sharing. Users generally have unrestricted access to shared content; they can be downloaded, edited, and re-uploaded by all.

Local Area Network sharing

In local Area Networks, setting up a file to be shared does not involve any submission process or mediation. Being that members of the network have some level of trust between them, to setup a file for sharing, all you have to do is go into the file’s properties and enable its sharing property. Access restrictions can also be set to restrict read and or write properties of the files or directories being shared.

  • Read only

In this setting the user is only allowed to view contents of the file. This is to say that no changes can be made to the root file. The only way around this is to copy the particular file over and make changes to your local copy.

  • Write only

This setting is used on directories. In this setting a directory will be turned into a drop box. That is to say another user on the network can write files to the given directory but cannot view the contents of the directory. Access to read the contents of the directory is only for the owner of the directory.

  • Read and Write

This setting will allow the user to make changes the file, and save these changes on to the root file. In this, the file does not need to be copied over. In a directory case, contents of the directory can be modified remotely.

Locating shared files

People share files so that themselves and or other people may access it remotely. As such, finding a file that has been shared is a key step in the process of sharing. Methods of locating shared files differ between sharing systems.

Peer-2-peer file search

n peer-2-peer systems, finding the shared files you want is pretty easy. Non-torrent networks like Kazaa have a centralized server that holds lists of who is sharing what [3]. In order to search thorough this list, a third-party peer-2-peer application is needed. However cleaning of the file lists on these types of systems is poor which results in users sometimes downloading “fake” files.

In torrent networks like Bit-torrent where the shared files are checked on submission, the likelihood of downloading a fake file is reduced. However, searching for a shared file is done via third party search engines like supernova.com and isohunt.com.

Local Area Network file search

In local area networks, in order to find shared files you need to know where the file is located. This is to say that if lets say you are looking for a particular file and you don’t know the location, you may have to comb through the entire network manually in search of this file.

Transferring the file

In order to access a file over any network, some level of transfer needs to be made whether temporary or permanent. Files are transferred temporarily only if they only need to be viewed or edited. Files are transferred permanently if it is being copied or moved completely. File sharing systems like peer-2-peer only transfer files permanently, whereas most local file sharing systems over a local area network will only make a permanent transfer when a copy or cut command is executed.

Peer-2-peer file transfer

After the user has identified his target file. Depending on the type of the peer-2-peer network, there are two main ways the file can be transferred to the user.

  • Single user to single user transfer

In this style of transfer, the complete file is downloaded from a single source. Non-torrent peer-2-peer networks use this style of transfer. Torrent networks only uses this style when dealing with shared files that only have a single seed.

  • Multiple users to single user transfer

In this style of transfer, the file is simultaneously downloaded from multiple sources. This is the style more used by torrent networks like Bit torrent. Files shared on torrent networks are split into chunks. The torrent file itself hold information about seeds for the particular shared file. As such, different chunks of the shared file is downloaded simultaneously onto the users computer and reassembled. This way much higher download speeds can be achieved compared to the single-to-single user transfers.

Local operating system file transfer

In a local area network setting, files are generally viewed from the root. Technically, the complete or portions of the file are transferred to main memory and then viewed form there, the same way it would if you had a local copy. The only difference being that instead of the transfer being made from your local storage (hard drive) to main memory, the transfer is from a remote storage device somewhere on the network to main memory. The only real reason why this can be done is that transfer speeds over a local network is faster than over the Internet. As such, access restrictions can properly be enforced.

Sharing of Distributed Files

When we think of file sharing we generally think of the file location being on our computer. With a distributed file system the location of the file to which we want to share most likely will not physically be on our computer. This brings a level of complexity to the actual sharing of the file.

Sharing of a file in a distributed operating system’s case will have to be scalable enough that it can be deployed over the Internet. This means that traditional AFP and SMB approaches will have difficulty scaling up to the task. Examples of file sharing systems that already work on this level as discussed are peer-2-peer networks and FTP. Defining an effective file sharing system for a distributed operating system the following challenges need to be addressed.

  • Transfer speed

When a file is to be transferred it should be done so with the highest speed possible. A torrent approach may not necessarily be a complete answer as multiple copies of the file is needed to improve speed. This will be a huge problem with sensitive files in which a user may not want multiple copies of it located all over the internet.

  • Duplicate files

As it is already, common files like music files may have millions of copies located on different computers all over the world. For a distributed file system, having so many copies of the same file is an ineffective use of space and should be avoided where possible.

  • File integrity

Corrupted files or fake files are an issue in sharing because they may end up corrupting computers that access the file. One way this is mitigated today is through reporting systems in which users can report a fake or corrupted file to the host or source. Another approach is by plain old checking systems that go through files checking its integrity. In torrent systems, as previously discussed, mediators manually do the checking of files.

  • File backup

This is a solution to help file integrity as well as data loss. If it is determined that a file has lost its integrity, there needs to be a mechanism to restore the integrity of the file. The easiest way to do this is to restore the file from a good backup. Data or file loss can happen in a lot of ways, for instance if a server in which the file is stored goes down. In this case, a back up copy needs to be located somewhere else that the user can access.

  • Access restrictions

File sharing systems like FTP, AFP and SMB can restrict a users ability to access a particular file with authentication mechanisms. Having such capabilities in a distributed environment for sharing is certainly necessary in order to have a more flexible and restricted sharing ability. AFP and SMB take access restrictions further to also restrict read and write capabilities.

  • Search capability

This can be looked at as more of a convenience measure than a need; it would be nice for a user to be able to search through all the shared files that he or she has access. Having this will certainly aid in the development of more user friendly distributed operating systems.

Conclusion

File sharing is a need necessary to accomplish many collaborative tasks not only in the work place, but in other areas as well. We have discussed the differences in some of the popular file sharing systems being used today like peer-2-peer networks and Local Area Network file sharing. The similarity between both of these is that the shared files are stored on the host computers. In a distributed environment this may not be the case. Through the study of the current file sharing systems, we have found that in order to develop an effective file sharing system for a distributed operating system, challenges such as, transfer speeds, duplicate files, file integrity, file backup, access restrictions, and search capabilities need to be addressed. Current file sharing systems address some of these issues but no single one addresses all of them properly. As such maybe a hybrid between the Local Area Network sharing and Internet based file sharing is needed.

References

[1] J. Pouwelse, P. Garbacki, D. Epema, H. Sips. The Bit-torrent P2P File-Sharing System. Delft University of Technology, Delft, The Netherlands.

[2] R. Bhagwan, S. Savage, and G. M. Voelker. Understanding availability. In Inter- national Workshop on Peer to Peer Systems, Berkeley, CA, USA, February 2003.

[3] B. Cohen. Incentives build robustness in bittorrent. In Workshop on Economics of Peer-to- Peer Systems, Berkeley, USA, May 2003.

[4] S. Saroiu, P. Krishna, G. Steven, D. Gribble. A Measurement Study of Peer-to-peer File Sharing Systems. University of Washington, Seattle, WA, USA.

[5] N. Leibowitz, M. Ripeanu, and A. Wierzbicki. Deconstructing the kazaa network. In 3rd IEEE Workshop on Internet Applications (WIAPP’03), San Jose, CA, USA, June 2003.

[6] R. Sherwood, R. Braud, and B. Bhattacharjee. Slurpie: A cooperative bulk data transfer protocol. In IEEE Infocom, Honk Kong, China, March 2004.

[7] B.T. Loo, J.M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica. Enhancing P2P File-Sharing with an Internet-Scale Query Processor.UC Berkeley. VLDB Conference, Toronto, Canada, 2004.