DistOS 2014W Lecture 10: Difference between revisions
 Created page with "==Context==  ==GFS==  ==Ceph=="  | 
				 Raw note dump  | 
				||
| Line 1: | Line 1: | ||
==Context==  | ==Context==  | ||
==GFS==  | == GFS ==  | ||
* Very different because of the workload that it is desgined for.  | |||
** Because of the number of small files that have to be indexed for the web, etc., it is no longer practical to have a filesystem that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.  | |||
* Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.  | |||
* Mostly seeking through entire file.  | |||
* Paper from 2003, mentions still using 100BASE-T links.  | |||
* Data-heavy, metadata light. Contacting the metadata server is a rare event.  | |||
* Really good that they designed for unreliable hardware:  | |||
** All the replication  | |||
** Data checksumming  | |||
* Performance degrades for small random access workload; use other filesystem.  | |||
* Path of least resistance to scale, not to do something super CS-smart.  | |||
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.  | |||
=== Segue on drives ===  | |||
* Structure of GFS does match some other modern systems:  | |||
** Hard drives are like parallel tapes, very suited for streaming.  | |||
** Flash devices are log-structured too, but have an abstracting firmware. You want to do erasure in bulk, in the '''background'''. Used to be we needed specialized FS for MTDs to get better performance; though now we have better microcontrollers in some embedded systems to abstract away the hardware.  | |||
* Architectures that start big, often end up in the smallest things.  | |||
== How other filesystems compare to GFS and Ceph ==  | |||
* Data and metadata are held together.  | |||
** Doesn't account for different access patterns:  | |||
*** Data → big, long transfers  | |||
*** Metadata → small, low latency  | |||
** Can't scale separately  | |||
* By design, a file is a fraction of the size of a server  | |||
** Huge files spread over many servers not even in the cards for NFS  | |||
** Meant for small problems, not web-scale  | |||
*** Google has a copy of the publicly accessible internet  | |||
**** Their strategy is to copy the internet to index it  | |||
**** Insane → insane filesystem  | |||
* Designed for lower latency  | |||
* Designed for POSIX semantics; how the requirements that lead to the ‘standard’ evolved  | |||
* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have this scale.  | |||
* Reliability was a property of the host, not the network  | |||
* Point-to-point access; much less load-balancing, even in AFS  | |||
** Single point of entry, single point of failure, bottleneck  | |||
==Ceph==  | ==Ceph==  | ||
<ul>  | |||
<li>Ceph is crazy and tries to do everything</li>  | |||
<li>Unlike GFS, distributes metadata, not just for read-only copies</li>  | |||
<li>Unlike GFS, the OSDs have some intelligence, and autonomously distribute the data, rather than being controlled by a master.  | |||
<ul>  | |||
<li>Uses hashing in the distribution process to '''uniformly''' distribute data</li>  | |||
<li><p>The actual algorithm for distributing data is as follows:</p>  | |||
<p><math>file + offset → hash(object ID) → CRUSH(placement group) → OSD</math></p></li>  | |||
<li>Each client has knowledge of the entire storage network.</li>  | |||
<li>Tracks failure groups (same breaker, switch, etc.), hot data, etc.</li>  | |||
<li>Number of replicas is changeable on the fly, but the placement group is not  | |||
<ul>  | |||
<li>For example, if every client on the planet is accessing the same file, you can scale out for that data.</li></ul>  | |||
</li>  | |||
<li>You don't ask where to go, you just go, which makes this very scalable</li></ul>  | |||
</li>  | |||
<li>CRUSH is sufficiently advanced to be called magic.  | |||
<ul>  | |||
<li><math>O(log n)</math> of the size of the data</li>  | |||
<li>CPUs stupidly fast, so the above is of minimal overhead, whereas the network, despite being fast, has latency, etc. Computation scales much better than communication.</li></ul>  | |||
</li>  | |||
<li>Storage is composed of variable-length atoms</li></ul>  | |||
Revision as of 16:30, 6 February 2014
Context
GFS
- Very different because of the workload that it is desgined for.
- Because of the number of small files that have to be indexed for the web, etc., it is no longer practical to have a filesystem that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
 
 - Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
 - Mostly seeking through entire file.
 - Paper from 2003, mentions still using 100BASE-T links.
 - Data-heavy, metadata light. Contacting the metadata server is a rare event.
 - Really good that they designed for unreliable hardware:
- All the replication
 - Data checksumming
 
 - Performance degrades for small random access workload; use other filesystem.
 - Path of least resistance to scale, not to do something super CS-smart.
 - Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.
 
Segue on drives
- Structure of GFS does match some other modern systems:
- Hard drives are like parallel tapes, very suited for streaming.
 - Flash devices are log-structured too, but have an abstracting firmware. You want to do erasure in bulk, in the background. Used to be we needed specialized FS for MTDs to get better performance; though now we have better microcontrollers in some embedded systems to abstract away the hardware.
 
 - Architectures that start big, often end up in the smallest things.
 
How other filesystems compare to GFS and Ceph
- Data and metadata are held together.
- Doesn't account for different access patterns:
- Data → big, long transfers
 - Metadata → small, low latency
 
 - Can't scale separately
 
 - Doesn't account for different access patterns:
 - By design, a file is a fraction of the size of a server
- Huge files spread over many servers not even in the cards for NFS
 - Meant for small problems, not web-scale
- Google has a copy of the publicly accessible internet
- Their strategy is to copy the internet to index it
 - Insane → insane filesystem
 
 
 - Google has a copy of the publicly accessible internet
 
 - Designed for lower latency
 - Designed for POSIX semantics; how the requirements that lead to the ‘standard’ evolved
 - Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have this scale.
 - Reliability was a property of the host, not the network
 - Point-to-point access; much less load-balancing, even in AFS
- Single point of entry, single point of failure, bottleneck
 
 
Ceph
- Ceph is crazy and tries to do everything
 - Unlike GFS, distributes metadata, not just for read-only copies
 - Unlike GFS, the OSDs have some intelligence, and autonomously distribute the data, rather than being controlled by a master.
- Uses hashing in the distribution process to uniformly distribute data
 The actual algorithm for distributing data is as follows:
<math>file + offset → hash(object ID) → CRUSH(placement group) → OSD</math>
- Each client has knowledge of the entire storage network.
 - Tracks failure groups (same breaker, switch, etc.), hot data, etc.
 - Number of replicas is changeable on the fly, but the placement group is not
- For example, if every client on the planet is accessing the same file, you can scale out for that data.
 
 - You don't ask where to go, you just go, which makes this very scalable
 
 - CRUSH is sufficiently advanced to be called magic.
- <math>O(log n)</math> of the size of the data
 - CPUs stupidly fast, so the above is of minimal overhead, whereas the network, despite being fast, has latency, etc. Computation scales much better than communication.
 
 - Storage is composed of variable-length atoms