Revision as of 19:23, 11 April 2015

Ceph

Unlike GFS that was talked about previously, this is a general purpose distributed file system. It follows the same general model of distribution as GFS and Amoeba.

HOW IT WORKS Ceph file system runs on top of the same object storage system that provides object storage and block device interfaces. The Ceph metadata server cluster provides a service that maps the directories and filenames of the file system to objects stored within RADOS clusters. The metadata server cluster can expand or contract, and it can rebalance the file system dynamically to distribute data evenly among cluster hosts. This ensures high performance and prevents heavy loads on specific hosts within the cluster.

BENEFITS The Ceph file system provides numerous benefits: It provides stronger data safety for mission-critical applications. It provides virtually unlimited storage to file systems. Applications that use file systems can use Ceph FS with POSIX semantics. No integration or customization required! Ceph automatically balances the file system to deliver maximum performance.

Main Components

Client

Cluster of Object Storage Devices (OSD)
- It basically stores data and metadata and clients communicate directly with it to perform IO operations
- Data is stored in objects (variable size chunks)

Meta-data Server (MDS)
- It is used to manage the file and directories. Clients interact with it to perform metadata operations like open, rename. It manages the capabilities of a client.
- Clients do not need to access MDSs to find where data is stored, improving scalability (more on that below)

Key Features

Decoupled data and metadata

Dynamic Distributed Metadata Management

- It distributes the metadata among multiple metadata servers using dynamic sub-tree partitioning, meaning folders that get used more often get their meta-data replicated to more servers, spreading the load. This happens completely automatically

Object based storage
- Uses cluster of OSDs to form a Reliable Autonomic Distributed Object-Store (RADOS) for Ceph failure detection and recovery

CRUSH (Controlled, Replicated, Under Scalable, Hashing)
- The hashing algorithm used to calculate the location of object instead of looking for them
- This significantly reduces the load on the MDSs because each client has enough information to independently determine where things should be located.
- Responsible for automatically moving data when ODSs are added or removed (can be simplified as location = CRUSH(filename) % num_servers)
- The CRUSH paper on Ceph’s website can be viewed here

RADOS (Reliable Autonomic Distributed Object-Store) is the object store for Ceph

Chubby

It is a coarse grained lock service, made and used internally by Google, that serves multiple clients with small number of servers (chubby cell).

System Components

Chubby Cell
- Handles the actual locks
- Mainly consists of 5 servers known as replicas
- Consensus protocol (Paxos) is used to elect the master from replicas

Client
- Used by programs to request and use locks

Main Features

Implemented as a semi POSIX-compliant file-system with a 256KB file limit
- Permissions are for files only, not folders, thus breaking some compatibility
- Trivial to use by programs; just use the standard fopen() family of calls

Uses consensus algorithm (paxos) among a set of servers to agree on who is the master that is in charge of the metadata

Meant for locks that last hours or days, not seconds (thus, "coarse grained")

A master server can handle tens of thousands of simultaneous connections
- This can be further improved by using caching servers as most of the traffic are keep-alive messages

Used by GFS for electing master server

Also used by Google as a nameserver

Issues

Due to the use of Paxos (the only algorithm for this problem), the chubby cell is limited to only 5 servers. While this limits fault-tolerance, in practice this is more than enough.
Since it has a file-system client interface, many programmers tend to abuse the system and need education (even inside Google)
Clients need to constantly ping Chubby to verify that they still exist. This is to ensure that a server disappearing while holding a lock does not indefinitely hold that lock.
Clients must also consequently re-verify that they hold a lock that they think they hold because Chubby may have timed them out.