Difference between revisions of "DistOS 2014W Lecture 10"

From Soma-notes
Jump to navigation Jump to search
(Created page with "==Context== ==GFS== ==Ceph==")
 
(Raw note dump)
Line 1: Line 1:
==Context==
==Context==


==GFS==
== GFS ==
 
* Very different because of the workload that it is desgined for.
** Because of the number of small files that have to be indexed for the web, etc., it is no longer practical to have a filesystem that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
* Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
* Mostly seeking through entire file.
* Paper from 2003, mentions still using 100BASE-T links.
* Data-heavy, metadata light. Contacting the metadata server is a rare event.
* Really good that they designed for unreliable hardware:
** All the replication
** Data checksumming
* Performance degrades for small random access workload; use other filesystem.
* Path of least resistance to scale, not to do something super CS-smart.
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.
 
=== Segue on drives ===
 
* Structure of GFS does match some other modern systems:
** Hard drives are like parallel tapes, very suited for streaming.
** Flash devices are log-structured too, but have an abstracting firmware. You want to do erasure in bulk, in the '''background'''. Used to be we needed specialized FS for MTDs to get better performance; though now we have better microcontrollers in some embedded systems to abstract away the hardware.
* Architectures that start big, often end up in the smallest things.
 
== How other filesystems compare to GFS and Ceph ==
 
* Data and metadata are held together.
** Doesn't account for different access patterns:
*** Data → big, long transfers
*** Metadata → small, low latency
** Can't scale separately
* By design, a file is a fraction of the size of a server
** Huge files spread over many servers not even in the cards for NFS
** Meant for small problems, not web-scale
*** Google has a copy of the publicly accessible internet
**** Their strategy is to copy the internet to index it
**** Insane → insane filesystem
* Designed for lower latency
* Designed for POSIX semantics; how the requirements that lead to the ‘standard’ evolved
* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have this scale.
* Reliability was a property of the host, not the network
* Point-to-point access; much less load-balancing, even in AFS
** Single point of entry, single point of failure, bottleneck


==Ceph==
==Ceph==
<ul>
<li>Ceph is crazy and tries to do everything</li>
<li>Unlike GFS, distributes metadata, not just for read-only copies</li>
<li>Unlike GFS, the OSDs have some intelligence, and autonomously distribute the data, rather than being controlled by a master.
<ul>
<li>Uses hashing in the distribution process to '''uniformly''' distribute data</li>
<li><p>The actual algorithm for distributing data is as follows:</p>
<p><math>file + offset → hash(object ID) → CRUSH(placement group) → OSD</math></p></li>
<li>Each client has knowledge of the entire storage network.</li>
<li>Tracks failure groups (same breaker, switch, etc.), hot data, etc.</li>
<li>Number of replicas is changeable on the fly, but the placement group is not
<ul>
<li>For example, if every client on the planet is accessing the same file, you can scale out for that data.</li></ul>
</li>
<li>You don't ask where to go, you just go, which makes this very scalable</li></ul>
</li>
<li>CRUSH is sufficiently advanced to be called magic.
<ul>
<li><math>O(log n)</math> of the size of the data</li>
<li>CPUs stupidly fast, so the above is of minimal overhead, whereas the network, despite being fast, has latency, etc. Computation scales much better than communication.</li></ul>
</li>
<li>Storage is composed of variable-length atoms</li></ul>

Revision as of 12:30, 6 February 2014

Context

GFS

  • Very different because of the workload that it is desgined for.
    • Because of the number of small files that have to be indexed for the web, etc., it is no longer practical to have a filesystem that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
  • Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
  • Mostly seeking through entire file.
  • Paper from 2003, mentions still using 100BASE-T links.
  • Data-heavy, metadata light. Contacting the metadata server is a rare event.
  • Really good that they designed for unreliable hardware:
    • All the replication
    • Data checksumming
  • Performance degrades for small random access workload; use other filesystem.
  • Path of least resistance to scale, not to do something super CS-smart.
  • Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.

Segue on drives

  • Structure of GFS does match some other modern systems:
    • Hard drives are like parallel tapes, very suited for streaming.
    • Flash devices are log-structured too, but have an abstracting firmware. You want to do erasure in bulk, in the background. Used to be we needed specialized FS for MTDs to get better performance; though now we have better microcontrollers in some embedded systems to abstract away the hardware.
  • Architectures that start big, often end up in the smallest things.

How other filesystems compare to GFS and Ceph

  • Data and metadata are held together.
    • Doesn't account for different access patterns:
      • Data → big, long transfers
      • Metadata → small, low latency
    • Can't scale separately
  • By design, a file is a fraction of the size of a server
    • Huge files spread over many servers not even in the cards for NFS
    • Meant for small problems, not web-scale
      • Google has a copy of the publicly accessible internet
        • Their strategy is to copy the internet to index it
        • Insane → insane filesystem
  • Designed for lower latency
  • Designed for POSIX semantics; how the requirements that lead to the ‘standard’ evolved
  • Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have this scale.
  • Reliability was a property of the host, not the network
  • Point-to-point access; much less load-balancing, even in AFS
    • Single point of entry, single point of failure, bottleneck

Ceph

  • Ceph is crazy and tries to do everything
  • Unlike GFS, distributes metadata, not just for read-only copies
  • Unlike GFS, the OSDs have some intelligence, and autonomously distribute the data, rather than being controlled by a master.
    • Uses hashing in the distribution process to uniformly distribute data
    • The actual algorithm for distributing data is as follows:

      <math>file + offset → hash(object ID) → CRUSH(placement group) → OSD</math>

    • Each client has knowledge of the entire storage network.
    • Tracks failure groups (same breaker, switch, etc.), hot data, etc.
    • Number of replicas is changeable on the fly, but the placement group is not
      • For example, if every client on the planet is accessing the same file, you can scale out for that data.
    • You don't ask where to go, you just go, which makes this very scalable
  • CRUSH is sufficiently advanced to be called magic.
    • <math>O(log n)</math> of the size of the data
    • CPUs stupidly fast, so the above is of minimal overhead, whereas the network, despite being fast, has latency, etc. Computation scales much better than communication.
  • Storage is composed of variable-length atoms