DistOS 2015W Session 7: Difference between revisions
 Formatting, updates and additions to both sections.  | 
				|||
| Line 27: | Line 27: | ||
* CRUSH (Controlled, Replicated, Under Scalable, Hashing)  | * CRUSH (Controlled, Replicated, Under Scalable, Hashing)  | ||
** The hashing algorithm used to calculate the location of object instead of looking for them  | ** The hashing algorithm used to calculate the location of object instead of looking for them  | ||
** This significantly reduces the load on the MDSs  | ** This significantly reduces the load on the MDSs because each client has enough information to independently determine where things should be located.  | ||
** Responsible for automatically moving data when ODSs are added or removed (can be simplified as ''location = CRUSH(filename) % num_servers'')  | ** Responsible for automatically moving data when ODSs are added or removed (can be simplified as ''location = CRUSH(filename) % num_servers'')  | ||
** The CRUSH paper on Ceph’s website can be [http://ceph.com/papers/weil-crush-sc06.pdf viewed here]  | ** The CRUSH paper on Ceph’s website can be [http://ceph.com/papers/weil-crush-sc06.pdf viewed here]  | ||
* RADOS (Reliable Autonomic Distributed Object-Store) is the object store for Ceph  | * RADOS (Reliable Autonomic Distributed Object-Store) is the object store for Ceph  | ||
= Chubby =  | = Chubby =  | ||
Revision as of 05:01, 6 April 2015
Ceph
Unlike GFS that was talked about previously, this is a general purpose distributed file system. It follows the same general model of distribution as GFS and Amoeba.
Main Components
- Client
 
- Cluster of Object Storage Devices (OSD)
- It basically stores data and metadata and clients communicate directly with it to perform IO operations
 - Data is stored in objects (variable size chunks)
 
 
- Meta-data Server (MDS)
- It is used to manage the file and directories. Clients interact with it to perform metadata operations like open, rename. It manages the capabilities of a client.
 - Clients do not need to access MDSs to find where data is stored, improving scalability (more on that below)
 
 
Key Features
- Decoupled data and metadata
 
- Dynamic Distributed Metadata Management
 
- It distributes the metadata among multiple metadata servers using dynamic sub-tree partitioning, meaning folders that get used more often get their meta-data replicated to more servers, spreading the load. This happens completely automatically
 
- Object based storage
- Uses cluster of OSDs to form a Reliable Autonomic Distributed Object-Store (RADOS) for Ceph failure detection and recovery
 
 
- CRUSH (Controlled, Replicated, Under Scalable, Hashing)
- The hashing algorithm used to calculate the location of object instead of looking for them
 - This significantly reduces the load on the MDSs because each client has enough information to independently determine where things should be located.
 - Responsible for automatically moving data when ODSs are added or removed (can be simplified as location = CRUSH(filename) % num_servers)
 - The CRUSH paper on Ceph’s website can be viewed here
 
 
- RADOS (Reliable Autonomic Distributed Object-Store) is the object store for Ceph
 
Chubby
It is a coarse grained lock service, made and used internally by Google, that serves multiple clients with small number of servers (chubby cell).
System Components
- Chubby Cell
- Handles the actual locks
 - Mainly consists of 5 servers known as replicas
 - Consensus protocol (Paxos) is used to elect the master from replicas
 
 
- Client
- Used by programs to request and use locks
 
 
Main Features
- Implemented as a semi POSIX-compliant file-system with a 256KB file limit
- Permissions are for files only, not folders, thus breaking some compatibility
 - Trivial to use by programs; just use the standard fopen() family of calls
 
 
- Uses consensus algorithm among a set of servers to agree on who is the master that is in charge of the metadata
 
- Meant for locks that last hours or days, not seconds (thus, "coarse grained")
 
- A master server can handle tens of thousands of simultaneous connections
- This can be further improved by using caching servers as most of the traffic are keep-alive messages
 
 
- Used by GFS for electing master server
 
- Also used by Google as a nameserver
 
Issues
- Due to the use of Paxos (the only algorithm for this problem), the chubby cell is limited to only 5 servers. While this limits fault-tolerance, in practice this is more than enough.
 - Since it has a file-system client interface, many programmers tend to abuse the system and need education (even inside Google)