Difference between revisions of "DistOS 2015W Session 7"

From Soma-notes
Jump to navigation Jump to search
(Formatting, updates and additions to both sections.)
Line 1: Line 1:
= Ceph =
= Ceph =
* Key advantage is that it is a general purpose distributed file system.
Unlike GFS that was talked about previously, this is a general purpose distributed file system. It follows the same general model of distribution as GFS and Amoeba.
* System is composed of three units:
 
*Client,
 
*Cluster of Object Storage device (OSDs): It is basically stores data and metadata and clients communicate directly with it to perform IO operations.
== Main Components ==
*MetaData Server (MDS): It is used to manage the file and directories.Client basically interacts with it to perform metadata operations like open, rename. It manages the capabilities of a client.
* Client
* system has three key features:
 
    * decoupled data and metadata:
* Cluster of Object Storage Devices (OSD)
    * Dynamic Distributed Metadata Management: It distribute the metadata among multiple metadata servers using dynamic subtree partitioning to increase the performance and avoid metadata access hot spots.
** It basically stores data and metadata and clients communicate directly with it to perform IO operations
    * Object based storage: Using cluster of OSDs to form a Reliable Autonomic Distributed Object-Store(RADOS) for ceph failure detection and recovery.  
** Data is stored in objects (variable size chunks)
 
* Meta-data Server (MDS)
** It is used to manage the file and directories. Clients interact with it to perform metadata operations like open, rename. It manages the capabilities of a client.
** Clients '''do not''' need to access MDSs to find where data is stored, improving scalability (more on that below)
 
 
== Key Features ==
* Decoupled data and metadata
 
* Dynamic Distributed Metadata Management
 
** It distributes the metadata among multiple metadata servers using dynamic sub-tree partitioning, meaning folders that get used more often get their meta-data replicated to more servers, spreading the load. This happens completely automatically
 
* Object based storage
** Uses cluster of OSDs to form a Reliable Autonomic Distributed Object-Store (RADOS) for Ceph failure detection and recovery
 
* CRUSH (Controlled, Replicated, Under Scalable, Hashing)
** The hashing algorithm used to calculate the location of object instead of looking for them
** This significantly reduces the load on the MDSs
** Responsible for automatically moving data when ODSs are added or removed (can be simplified as ''location = CRUSH(filename) % num_servers'')
** The CRUSH paper on Ceph’s website can be [http://ceph.com/papers/weil-crush-sc06.pdf viewed here]
 
* RADOS (Reliable Autonomic Distributed Object-Store) is the object store for Ceph


*CRUSH (Controlled, Replicated, Under Scalable, Hashing) is the hashing algorithm used to calculate the location of object instead of looking for them. The CRUSH paper on Ceph’s website can be downloaded from here http://ceph.com/papers/weil-crush-sc06.pdf.
* RADOS (Reliable Autonomic Distributed Object-Store) is the object store for Ceph.


= Chubby =
= Chubby =
It is basically a coarse grained lock service that serves multiple clients with small number of servers (chubby cell).
It is a coarse grained lock service, made and used internally by Google, that serves multiple clients with small number of servers (chubby cell).
 
 
== System Components ==
* Chubby Cell
** Handles the actual locks
** Mainly consists of 5 servers known as replicas
** Consensus protocol ([https://en.wikipedia.org/wiki/Paxos_(computer_science) Paxos]) is used to elect the master from replicas
 
* Client
** Used by programs to request and use locks
 
 
== Main Features ==
* Implemented as a semi POSIX-compliant file-system with a 256KB file limit
** Permissions are for files only, not folders, thus breaking some compatibility
** Trivial to use by programs; just use the standard ''fopen()'' family of calls
 
* Uses consensus algorithm among a set of servers to agree on who is the master that is in charge of the metadata
 
* Meant for locks that last hours or days, not seconds (thus, "coarse grained")
 
* A master server can handle tens of thousands of simultaneous connections
** This can be further improved by using caching servers as most of the traffic are keep-alive messages
 
* Used by GFS for electing master server
 
* Also used by Google as a nameserver


== system consists ==
*Chubby Cell: mainly consists of 5 servers known as replicas. Consensus protocol is used to elect the master from replicas.
*Client: Find the master between the replicas. Consensus protocol is used to propagate the write request to the majority of servers. Read request is handled by master only.
communication between client and server is via RPCs.
 


* Is a consensus algorithm among a set of servers to agree on who is the master that is in charge of the metadata.
== Issues ==
* Can be considered a distributed file system for small size files only “256 KB” with very low scalability “5 servers”.
* Due to the use of Paxos (the only algorithm for this problem), the chubby cell is limited to only 5 servers. While this limits fault-tolerance, in practice this is more than enough.
* Is defined in the paper as “A lock service used within a loosely-coupled distributed system consisting of moderately large number of small machines connected by a high speed network”.
* Since it has a file-system client interface, many programmers tend to abuse the system and need education (even inside Google)

Revision as of 12:59, 28 February 2015

Ceph

Unlike GFS that was talked about previously, this is a general purpose distributed file system. It follows the same general model of distribution as GFS and Amoeba.


Main Components

  • Client
  • Cluster of Object Storage Devices (OSD)
    • It basically stores data and metadata and clients communicate directly with it to perform IO operations
    • Data is stored in objects (variable size chunks)
  • Meta-data Server (MDS)
    • It is used to manage the file and directories. Clients interact with it to perform metadata operations like open, rename. It manages the capabilities of a client.
    • Clients do not need to access MDSs to find where data is stored, improving scalability (more on that below)


Key Features

  • Decoupled data and metadata
  • Dynamic Distributed Metadata Management
    • It distributes the metadata among multiple metadata servers using dynamic sub-tree partitioning, meaning folders that get used more often get their meta-data replicated to more servers, spreading the load. This happens completely automatically
  • Object based storage
    • Uses cluster of OSDs to form a Reliable Autonomic Distributed Object-Store (RADOS) for Ceph failure detection and recovery
  • CRUSH (Controlled, Replicated, Under Scalable, Hashing)
    • The hashing algorithm used to calculate the location of object instead of looking for them
    • This significantly reduces the load on the MDSs
    • Responsible for automatically moving data when ODSs are added or removed (can be simplified as location = CRUSH(filename) % num_servers)
    • The CRUSH paper on Ceph’s website can be viewed here
  • RADOS (Reliable Autonomic Distributed Object-Store) is the object store for Ceph


Chubby

It is a coarse grained lock service, made and used internally by Google, that serves multiple clients with small number of servers (chubby cell).


System Components

  • Chubby Cell
    • Handles the actual locks
    • Mainly consists of 5 servers known as replicas
    • Consensus protocol (Paxos) is used to elect the master from replicas
  • Client
    • Used by programs to request and use locks


Main Features

  • Implemented as a semi POSIX-compliant file-system with a 256KB file limit
    • Permissions are for files only, not folders, thus breaking some compatibility
    • Trivial to use by programs; just use the standard fopen() family of calls
  • Uses consensus algorithm among a set of servers to agree on who is the master that is in charge of the metadata
  • Meant for locks that last hours or days, not seconds (thus, "coarse grained")
  • A master server can handle tens of thousands of simultaneous connections
    • This can be further improved by using caching servers as most of the traffic are keep-alive messages
  • Used by GFS for electing master server
  • Also used by Google as a nameserver


Issues

  • Due to the use of Paxos (the only algorithm for this problem), the chubby cell is limited to only 5 servers. While this limits fault-tolerance, in practice this is more than enough.
  • Since it has a file-system client interface, many programmers tend to abuse the system and need education (even inside Google)