DistOS 2015W Session 7
Ceph
Unlike GFS that was talked about previously, this is a general purpose distributed file system. It follows the same general model of distribution as GFS and Amoeba.
Main Components
- Client
- Cluster of Object Storage Devices (OSD)
- It basically stores data and metadata and clients communicate directly with it to perform IO operations
- Data is stored in objects (variable size chunks)
- Meta-data Server (MDS)
- It is used to manage the file and directories. Clients interact with it to perform metadata operations like open, rename. It manages the capabilities of a client.
- Clients do not need to access MDSs to find where data is stored, improving scalability (more on that below)
Key Features
- Decoupled data and metadata
- Dynamic Distributed Metadata Management
- It distributes the metadata among multiple metadata servers using dynamic sub-tree partitioning, meaning folders that get used more often get their meta-data replicated to more servers, spreading the load. This happens completely automatically
- Object based storage
- Uses cluster of OSDs to form a Reliable Autonomic Distributed Object-Store (RADOS) for Ceph failure detection and recovery
- CRUSH (Controlled, Replicated, Under Scalable, Hashing)
- The hashing algorithm used to calculate the location of object instead of looking for them
- This significantly reduces the load on the MDSs because each client has enough information to independently determine where things should be located.
- Responsible for automatically moving data when ODSs are added or removed (can be simplified as location = CRUSH(filename) % num_servers)
- The CRUSH paper on Ceph’s website can be viewed here
- RADOS (Reliable Autonomic Distributed Object-Store) is the object store for Ceph
Chubby
It is a coarse grained lock service, made and used internally by Google, that serves multiple clients with small number of servers (chubby cell).
System Components
- Chubby Cell
- Handles the actual locks
- Mainly consists of 5 servers known as replicas
- Consensus protocol (Paxos) is used to elect the master from replicas
- Client
- Used by programs to request and use locks
Main Features
- Implemented as a semi POSIX-compliant file-system with a 256KB file limit
- Permissions are for files only, not folders, thus breaking some compatibility
- Trivial to use by programs; just use the standard fopen() family of calls
- Uses consensus algorithm among a set of servers to agree on who is the master that is in charge of the metadata
- Meant for locks that last hours or days, not seconds (thus, "coarse grained")
- A master server can handle tens of thousands of simultaneous connections
- This can be further improved by using caching servers as most of the traffic are keep-alive messages
- Used by GFS for electing master server
- Also used by Google as a nameserver
Issues
- Due to the use of Paxos (the only algorithm for this problem), the chubby cell is limited to only 5 servers. While this limits fault-tolerance, in practice this is more than enough.
- Since it has a file-system client interface, many programmers tend to abuse the system and need education (even inside Google)