Latest revision as of 10:12, 20 April 2015

Ceph

Unlike GFS that was talked about previously, this is a general purpose distributed file system (for various file size, from small to large). It follows the same general model of distribution as GFS and Amoeba.

HOW IT WORKS Ceph file system runs on top of the same object storage system that provides object storage and block device interfaces. The Ceph metadata server cluster provides a service that maps the directories and filenames of the file system to objects stored within RADOS clusters. The metadata server cluster can expand or contract, and it can rebalance the file system dynamically to distribute data evenly among cluster hosts. This ensures high performance and prevents heavy loads on specific hosts within the cluster.

BENEFITS The Ceph file system provides numerous benefits: It provides stronger data safety for mission-critical applications. It provides virtually unlimited storage to file systems. Applications that use file systems can use Ceph FS with POSIX semantics. No integration or customization required! Ceph automatically balances the file system to deliver maximum performance.

Advantages of using File systems

It supports heterogeneous operating systems including all flavors of the unix operating system as well as Linux and windows
Multiple client machines can access a single resource simultaneously.

enables sharing common application binaries and read only information instead of putting them on each single machine. This results in reduced overall disk storage cost and administration overhead.

Gives access to uniform data to groups of users.
Useful when many users exists on many systems with each user's home directory located on every single machine. Network file systems allows you to all users home directories on a single machine under /home

Main Components

Client

Cluster of Object Storage Devices (OSD)
- It basically stores data and metadata and clients communicate directly with it to perform IO operations
- Data is stored in objects (variable size chunks)

Meta-data Server (MDS)
- It is used to manage the file and directories. Clients interact with it to perform metadata operations like open, rename. It manages the capabilities of a client.
- Clients do not need to access MDSs to find where data is stored, improving scalability (more on that below)

Key Features

Decoupled data and metadata
- good for scalability of various file size, from small to large

Dynamic Distributed Metadata Management

- It distributes the metadata among multiple metadata servers using dynamic sub-tree partitioning, meaning folders that get used more often get their meta-data replicated to more servers, spreading the load. This happens completely automatically

Object based storage
- Uses cluster of OSDs to form a Reliable Autonomic Distributed Object-Store (RADOS) for Ceph failure detection and recovery

CRUSH (Controlled, Replicated, Under Scalable, Hashing)
- The hashing algorithm used to calculate the location of object instead of looking for them
- This significantly reduces the load on the MDSs because each client has enough information to independently determine where things should be located.
- Responsible for automatically moving data when ODSs are added or removed (can be simplified as location = CRUSH(filename) % num_servers)
- The CRUSH paper on Ceph’s website can be viewed here

RADOS (Reliable Autonomic Distributed Object-Store) is the object store for Ceph

Chubby

It is a coarse grained lock service, made and used internally by Google, that serves multiple clients with small number of servers (chubby cell).

Chubby was designed to be a lock service, a distributed system that clients could connect to and share access to small files. The servers providing the system are partitioned into a variety of cells and access for a particular file is managed through one elected master node in one cell. This master makes all decisions and informs the rest of the cell nodes of that decision. If the master fails, the other nodes elect a new master. The problem of asynchronous consensus is solved through the use of timeouts as a failure detector. To avoid the scaling problem of a single bottleneck, the number of cells can be increased with the cost of making some cells smaller.

System Components

Chubby Cell
- Handles the actual locks
- Mainly consists of 5 servers known as replicas
- Consensus protocol (Paxos) is used to elect the master from replicas

Client
- Used by programs to request and use locks

Main Features

Implemented as a semi POSIX-compliant file-system with a 256KB file limit
- Permissions are for files only, not folders, thus breaking some compatibility
- Trivial to use by programs; just use the standard fopen() family of calls

Uses consensus algorithm (paxos) among a set of servers to agree on who is the master that is in charge of the metadata

Meant for locks that last hours or days, not seconds (thus, "coarse grained")

A master server can handle tens of thousands of simultaneous connections
- This can be further improved by using caching servers as most of the traffic are keep-alive messages

Used by GFS for electing master server

Also used by Google as a nameserver

Issues

Due to the use of Paxos (the only algorithm for this problem), the chubby cell is limited to only 5 servers. While this limits fault-tolerance, in practice this is more than enough.
Since it has a file-system client interface, many programmers tend to abuse the system and need education (even inside Google)
Clients need to constantly ping Chubby to verify that they still exist. This is to ensure that a server disappearing while holding a lock does not indefinitely hold that lock.
Clients must also consequently re-verify that they hold a lock that they think they hold because Chubby may have timed them out.

@@ Line 1: / Line 1: @@
 = Ceph =
-* Key advantage is that it is a general purpose distributed file system.
+Unlike GFS that was talked about previously, this is a general purpose distributed file system (for various file size, from small to large). It follows the same general model of distribution as GFS and Amoeba.
-* System is composed of three units:
-	*Client,
-	*Cluster of Object Storage device (OSDs): It is basically stores data and metadata and clients communicate directly with it to perform IO operations.
-	*MetaData Server (MDS): It is used to manage the file and directories.Client basically interacts with it to perform metadata operations like open, rename. It manages the capabilities of a client.
-* system has three key features:
-     * decoupled data and metadata:
-     * Dynamic Distributed Metadata Management: It distribute the metadata among multiple metadata servers using dynamic subtree partitioning to increase the performance and avoid metadata access hot spots.
-     * Object based storage: Using cluster of OSDs to form a Reliable Autonomic Distributed Object-Store(RADOS) for ceph failure detection and recovery.
-*CRUSH (Controlled, Replicated, Under Scalable, Hashing) is the hashing algorithm used to calculate the location of object instead of looking for them. The CRUSH paper on Ceph’s website can be downloaded from here http://ceph.com/papers/weil-crush-sc06.pdf.
+HOW IT WORKS
-* RADOS (Reliable Autonomic Distributed Object-Store) is the object store for Ceph.
+Ceph file system runs on top of the same object storage system that provides object storage and block device interfaces. The Ceph metadata server cluster provides a service that maps the directories and filenames of the file system to objects stored within RADOS clusters. The metadata server cluster can expand or contract, and it can rebalance the file system dynamically to distribute data evenly among cluster hosts. This ensures high performance and prevents heavy loads on specific hosts within the cluster.
+BENEFITS
+The Ceph file system provides numerous benefits:
+It provides stronger data safety for mission-critical applications.
+It provides virtually unlimited storage to file systems.
+Applications that use file systems can use Ceph FS with POSIX semantics. No integration or customization required!
+Ceph automatically balances the file system to deliver maximum performance.
+== Advantages of using File systems ==
+*  It supports heterogeneous operating systems including all flavors of the unix operating system as well as Linux and windows
+* Multiple client machines can access a single resource simultaneously.
+enables sharing common application binaries and read only information instead of putting them on each single machine. This results in reduced overall disk storage cost and administration overhead.
+*Gives access to uniform data to groups of users.
+*Useful when many users exists on many systems with each user's home directory located on every single machine. Network file systems allows you to all users home directories on a single machine under /home
+== Main Components ==
+* Client
+* Cluster of Object Storage Devices (OSD)
+** It basically stores data and metadata and clients communicate directly with it to perform IO operations
+** Data is stored in objects (variable size chunks)
+* Meta-data Server (MDS)
+** It is used to manage the file and directories. Clients interact with it to perform metadata operations like open, rename. It manages the capabilities of a client.
+** Clients '''do not''' need to access MDSs to find where data is stored, improving scalability (more on that below)
+== Key Features ==
+* Decoupled data and metadata
+** good for scalability of various file size, from small to large
+* Dynamic Distributed Metadata Management
+** It distributes the metadata among multiple metadata servers using dynamic sub-tree partitioning, meaning folders that get used more often get their meta-data replicated to more servers, spreading the load. This happens completely automatically
+* Object based storage
+** Uses cluster of OSDs to form a Reliable Autonomic Distributed Object-Store (RADOS) for Ceph failure detection and recovery
+* CRUSH (Controlled, Replicated, Under Scalable, Hashing)
+** The hashing algorithm used to calculate the location of object instead of looking for them
+** This significantly reduces the load on the MDSs because each client has enough information to independently determine where things should be located.
+** Responsible for automatically moving data when ODSs are added or removed (can be simplified as ''location = CRUSH(filename) % num_servers'')
+** The CRUSH paper on Ceph’s website can be [http://ceph.com/papers/weil-crush-sc06.pdf viewed here]
+* RADOS (Reliable Autonomic Distributed Object-Store) is the object store for Ceph
 = Chubby =
-It is basically a coarse grained lock service that serves multiple clients with small number of servers (chubby cell).
+It is a coarse grained lock service, made and used internally by Google, that serves multiple clients with small number of servers (chubby cell).
+Chubby was designed to be a lock service, a distributed system that clients could connect to and share access to small files. The servers providing the system are partitioned into a variety of cells and access for a particular file is managed through one elected master node in one cell. This master makes all decisions and informs the rest of the cell nodes of that decision. If the master fails, the other nodes elect a new master. The problem of asynchronous consensus is solved through the use of timeouts as a failure detector. To avoid the scaling problem of a single bottleneck, the number of cells can be increased with the cost of making some cells smaller.
+== System Components ==
+* Chubby Cell
+** Handles the actual locks
+** Mainly consists of 5 servers known as replicas
+** Consensus protocol ([https://en.wikipedia.org/wiki/Paxos_(computer_science) Paxos]) is used to elect the master from replicas
+* Client
+** Used by programs to request and use locks
+== Main Features ==
+* Implemented as a semi POSIX-compliant file-system with a 256KB file limit
+** Permissions are for files only, not folders, thus breaking some compatibility
+** Trivial to use by programs; just use the standard ''fopen()'' family of calls
+* Uses consensus algorithm (paxos) among a set of servers to agree on who is the master that is in charge of the metadata
+* Meant for locks that last hours or days, not seconds (thus, "coarse grained")
+* A master server can handle tens of thousands of simultaneous connections
+** This can be further improved by using caching servers as most of the traffic are keep-alive messages
+* Used by GFS for electing master server
-== system consists ==
+* Also used by Google as a nameserver
-*Chubby Cell: mainly consists of 5 servers known as replicas. Consensus protocol is used to elect the master from replicas.
-*Client: Find the master between the replicas. Consensus protocol is used to propagate the write request to the majority of servers. Read request is handled by master only.
-communication between client and server is via RPCs.
-* Is a consensus algorithm among a set of servers to agree on who is the master that is in charge of the metadata.
+== Issues ==
-* Can be considered a distributed file system for small size files only “256 KB” with very low scalability “5 servers”.
+* Due to the use of Paxos (the only algorithm for this problem), the chubby cell is limited to only 5 servers. While this limits fault-tolerance, in practice this is more than enough.
-* Is defined in the paper as “A lock service used within a loosely-coupled distributed system consisting of moderately large number of small machines connected by a high speed network”.
+* Since it has a file-system client interface, many programmers tend to abuse the system and need education (even inside Google)
+* Clients need to constantly ping Chubby to verify that they still exist. This is to ensure that a server disappearing while holding a lock does not indefinitely hold that lock.
+* Clients must also consequently re-verify that they hold a lock that they think they hold because Chubby may have timed them out.