DistOS 2014W Lecture 10: Difference between revisions

From Soma-notes
No edit summary
Sjoy (talk | contribs)
Additional notes from the class discussion
Line 68: Line 68:
</li>
</li>
<li>Storage is composed of variable-length atoms</li></ul>
<li>Storage is composed of variable-length atoms</li></ul>
== ==
Additional notes from the class discussion:
In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server. One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.

Revision as of 21:02, 6 February 2014

Context

GFS

  • Very different because of the workload that it is desgined for.
    • Because of the number of small files that have to be indexed for the web, etc., it is no longer practical to have a filesystem that stores these individually. Too much overhead. Punts problem to userspace, incl. record delimitation.
  • Don't care about latency, surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
  • Mostly seeking through entire file.
  • Paper from 2003, mentions still using 100BASE-T links.
  • Data-heavy, metadata light. Contacting the metadata server is a rare event.
  • Really good that they designed for unreliable hardware:
    • All the replication
    • Data checksumming
  • Performance degrades for small random access workload; use other filesystem.
  • Path of least resistance to scale, not to do something super CS-smart.
  • Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.

Segue on drives

  • Structure of GFS does match some other modern systems:
    • Hard drives are like parallel tapes, very suited for streaming.
    • Flash devices are log-structured too, but have an abstracting firmware. You want to do erasure in bulk, in the background. Used to be we needed specialized FS for MTDs to get better performance; though now we have better microcontrollers in some embedded systems to abstract away the hardware.
  • Architectures that start big, often end up in the smallest things.

How other filesystems compare to GFS and Ceph

  • Data and metadata are held together.
    • Doesn't account for different access patterns:
      • Data → big, long transfers
      • Metadata → small, low latency
    • Can't scale separately
  • By design, a file is a fraction of the size of a server
    • Huge files spread over many servers not even in the cards for NFS
    • Meant for small problems, not web-scale
      • Google has a copy of the publicly accessible internet
        • Their strategy is to copy the internet to index it
        • Insane → insane filesystem
  • Designed for lower latency
  • Designed for POSIX semantics; how the requirements that lead to the ‘standard’ evolved
  • Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have this scale.
  • Reliability was a property of the host, not the network
  • Point-to-point access; much less load-balancing, even in AFS
    • Single point of entry, single point of failure, bottleneck
  • No notion of data replication.

Ceph

  • Ceph is crazy and tries to do everything
  • Unlike GFS, distributes metadata, not just for read-only copies
  • Unlike GFS, the OSDs have some intelligence, and autonomously distribute the data, rather than being controlled by a master.
    • Uses hashing in the distribution process to uniformly distribute data
    • The actual algorithm for distributing data is as follows:

      <math>file + offset → hash(object ID) → CRUSH(placement group) → OSD</math>

    • Each client has knowledge of the entire storage network.
    • Tracks failure groups (same breaker, switch, etc.), hot data, etc.
    • Number of replicas is changeable on the fly, but the placement group is not
      • For example, if every client on the planet is accessing the same file, you can scale out for that data.
    • You don't ask where to go, you just go, which makes this very scalable
  • CRUSH is sufficiently advanced to be called magic.
    • <math>O(log n)</math> of the size of the data
    • CPUs stupidly fast, so the above is of minimal overhead, whereas the network, despite being fast, has latency, etc. Computation scales much better than communication.
  • Storage is composed of variable-length atoms

Additional notes from the class discussion:

In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server. One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.