Ceph

Unlike GFS that was talked about previously, this is a general purpose distributed file system (for various file size, from small to large). It follows the same general model of distribution as GFS and Amoeba.

HOW IT WORKS Ceph file system runs on top of the same object storage system that provides object storage and block device interfaces. The Ceph metadata server cluster provides a service that maps the directories and filenames of the file system to objects stored within RADOS clusters. The metadata server cluster can expand or contract, and it can rebalance the file system dynamically to distribute data evenly among cluster hosts. This ensures high performance and prevents heavy loads on specific hosts within the cluster.

BENEFITS The Ceph file system provides numerous benefits: It provides stronger data safety for mission-critical applications. It provides virtually unlimited storage to file systems. Applications that use file systems can use Ceph FS with POSIX semantics. No integration or customization required! Ceph automatically balances the file system to deliver maximum performance.

Advantages of using File systems

It supports heterogeneous operating systems including all flavors of the unix operating system as well as Linux and windows
Multiple client machines can access a single resource simultaneously.

enables sharing common application binaries and read only information instead of putting them on each single machine. This results in reduced overall disk storage cost and administration overhead.

Gives access to uniform data to groups of users.
Useful when many users exists on many systems with each user's home directory located on every single machine. Network file systems allows you to all users home directories on a single machine under /home

Main Components

Client

Cluster of Object Storage Devices (OSD)
- It basically stores data and metadata and clients communicate directly with it to perform IO operations
- Data is stored in objects (variable size chunks)

Meta-data Server (MDS)
- It is used to manage the file and directories. Clients interact with it to perform metadata operations like open, rename. It manages the capabilities of a client.
- Clients do not need to access MDSs to find where data is stored, improving scalability (more on that below)

Key Features

Decoupled data and metadata
- good for scalability of various file size, from small to large

Dynamic Distributed Metadata Management

- It distributes the metadata among multiple metadata servers using dynamic sub-tree partitioning, meaning folders that get used more often get their meta-data replicated to more servers, spreading the load. This happens completely automatically

Object based storage
- Uses cluster of OSDs to form a Reliable Autonomic Distributed Object-Store (RADOS) for Ceph failure detection and recovery

CRUSH (Controlled, Replicated, Under Scalable, Hashing)
- The hashing algorithm used to calculate the location of object instead of looking for them
- This significantly reduces the load on the MDSs because each client has enough information to independently determine where things should be located.
- Responsible for automatically moving data when ODSs are added or removed (can be simplified as location = CRUSH(filename) % num_servers)
- The CRUSH paper on Ceph’s website can be viewed here

RADOS (Reliable Autonomic Distributed Object-Store) is the object store for Ceph

Chubby

It is a coarse grained lock service, made and used internally by Google, that serves multiple clients with small number of servers (chubby cell).

Chubby was designed to be a lock service, a distributed system that clients could connect to and share access to small files. The servers providing the system are partitioned into a variety of cells and access for a particular file is managed through one elected master node in one cell. This master makes all decisions and informs the rest of the cell nodes of that decision. If the master fails, the other nodes elect a new master. The problem of asynchronous consensus is solved through the use of timeouts as a failure detector. To avoid the scaling problem of a single bottleneck, the number of cells can be increased with the cost of making some cells smaller.

System Components

Chubby Cell
- Handles the actual locks
- Mainly consists of 5 servers known as replicas
- Consensus protocol (Paxos) is used to elect the master from replicas

Client
- Used by programs to request and use locks

Main Features

Implemented as a semi POSIX-compliant file-system with a 256KB file limit
- Permissions are for files only, not folders, thus breaking some compatibility
- Trivial to use by programs; just use the standard fopen() family of calls

Uses consensus algorithm (paxos) among a set of servers to agree on who is the master that is in charge of the metadata

Meant for locks that last hours or days, not seconds (thus, "coarse grained")

A master server can handle tens of thousands of simultaneous connections
- This can be further improved by using caching servers as most of the traffic are keep-alive messages

Used by GFS for electing master server

Also used by Google as a nameserver

Issues

Due to the use of Paxos (the only algorithm for this problem), the chubby cell is limited to only 5 servers. While this limits fault-tolerance, in practice this is more than enough.
Since it has a file-system client interface, many programmers tend to abuse the system and need education (even inside Google)
Clients need to constantly ping Chubby to verify that they still exist. This is to ensure that a server disappearing while holding a lock does not indefinitely hold that lock.
Clients must also consequently re-verify that they hold a lock that they think they hold because Chubby may have timed them out.