DistOS 2018F 2018-10-31

Readings

Corbett et al., "Spanner: Google’s Globally-Distributed Database" (OSDI 2012)
Chang et al., "BigTable: A Distributed Storage System for Structured Data" (OSDI 2006)

Notes

GFS: Why do they want to store stuff?

What is it being used for?

The search engine, a copy of the web, for the crawler and the indices built from the crawler. Motivated the design. Bring in lots of data and do something with it. Crawler going and downloading everything it can file. What kind of file system? Lots of small files? No, not efficient. Had an atomic record append. Find things and put together a bunch of pages, whole data structure...a record, an application object is what did the crawler find? Atomic append, got it, saved, done. When appending it, that matters b/c it is not a byte stream, it is an application object appended one after another. If crawling along, try to do a write, problem with the write....do it again and try a few times...how many times did you write it? Got written at least once. Record append, no garrantee that when you write something at the end, application level has to detect duplicates which might be done multiple times...makes sense in the context of a very large web crawler. Enormous files with lots of the internet in them, in one file. Large files. Inform the design of it, what is the chunk size, 64MB (designed for large web crawler, want to save everything). Lots of files and metadata...no, important b/c they did a lot of work to minimize metadata by design choice....One master server stored all metadata and you want it to be fast, cannot be the bottleneck so...loaded up with RAM etc. But run from memory, metadata lookup and then the chunk server. Ceph has an entire cluster, dynamically allocate metadata....GFS has one server with replicas and that’s it. Seems dumb, why? Simpler, ridiculously simple...not a general purpose file system, not POSIX compatible just developed for a web crawler. 64MB and one metadata server...optimized for atomic append operations on objects that may be replicated. Why they don’t use GFS now, optimized for batch processing...build and index and make available to the search engine....but now, oriented completely towards fresh data. But GFS is an interesting design from perspective of specialized storage, design for a specific application....lots of trade offs, high performance but limited functionality. If doing a read, what are you doing? What is the process. Ask the master first what chunk you are looking for....great here is the info on the chunks....read a byte range, give the chunks then the client gets the data as fast as it can digest it. The master server is not involved most of the time. Course grain nature of the chunks....64MB, the number of data that has to be transmitted, amplification factor...leverage to be able to do relative scaling.

Writes are more complicated .... three is the magic number (3 replicas)...send out, wait for ack...might need to re-transmit until done....how big of a buffer do you need? A good size, several GB of RAM. Crawling application, use a lot of systems all writing to the same file, order in which they will be appended...who knows but order doesn’t matter...why it is a weird kind of file.

Chubby, what is chubby? A file system. Hierarchical key value store, can you store big files? No, 256k....but it’s a file system....why not use a regular file system, why chubby? Consistency, consistent distributed file system....read and write will look the same at any time. Readers and writers will see the same view. Previously, not completely consistent....b/c going to use it for locks...lock files are fine on an individual system but here, distributed file system that is completely consistent to use for locking....not easy, kinda crazy. They use Paxos....same state for everyone, how many failures can you tolerate.....5 b/c the Paxos algorithm is 2n+1 so if want to tolerate two failures, should have 5 machines....high level of reliability. GFS when it wants to elect a master at a cell....go to chubby, find out who the master is and start talking to the master....if master is down or not responding...ask chubby, elects a new master....reliable backbone to elect who is in-charge....when master fails, switch over and transmit state to everyone quickly....chubby is how it was done.

Paxos does not scale...5 servers, go to 50...NO, chatty protocol so...what you need consistency for, can get it but must pay...so the rest of the system must be less consistent. If force consistency on everything all the time, never get performance. Paxos is the algorithm chubby used to coordinate state in the file system. Managing uncertainty....need the same state...paxos is a data structure, updates to the data structure....could always ask the current value from chubby and get the same value from any of the nodes...might have a delay...don’t know yet...wait and then when gives answer, is the correct answer.

Sequencer in Chubby: consensus....what is the ordering the states? Who goes first, second third and fourth....so if enforce ordering, will pay a price. Can do this with a lock generation number. Ordering, temporal ordering, is more difficult than consensus of state.

Hadoop ZooKeeper...the things people built after Google published...a whole set of technologies....not identical to GFS but inspired by GFS

How does Hadoop compare to AWS....amazon has own version of this stuff. Can go to Amazon and tell them to give you a bunch of Vms and can deploy Hadoop.....what AWS does, they can do it for you. What they offer S3 (cheapest storage)

More notes

GFS
- Google doesn't use GFS anymore, use something newer
- Google needed to store lots of sequential data for their crawler and indices. Downloading everything and storing it.
- Atomic record append operation, crawler download webpage, then just stick it on with record append.
  - No guarantee that only written once. Application level has to verify uniqueness, makes perfect sense in the context of giant web crawler.
- Enormous files with huge parts of internet.
  - Reflected in 64MB chunk size
  - Not a lot of metadata, not a lot of files, allows you to store it in 1 metadata server. Can load it up with run and will run mostly from memory.
- FAR, FAR, simpler than Ceph, ceph had many metadata servers
- SIMPLICITY: Need something to store results of web crawler.
- GFS optimized for batch processing, but now need fresh data.
- Specialized design allows you to make more tradeoffs, high performance and simple
- If you're doing a read for a file you ask master what chunks its stored in and master server gives you a big thing of chunks.
- Optimized for appends, dont expect things to change.
- Writes much more complicated, gonna be slow...something something
- Crawling application has a bunch of systems writing to a single file, order doesn't matter.
Chubby
- GFS uses chubby for master election
- Hierarchical key value store
- Key idea is consistent distributed file system, consistent at a fine level of granularity, its gonna look the same at any time, all readers and writers will see the same view, nothing else in the course is as consistent.
- Idea of Chubby is you want a completely consistent distributed file system so you can use it for locking
- Use Paxos: an entire research area aimed on distributed consensus.
  - You have distributed nodes, you want the same state, how many failures can you tolerate.
  - Typical chubby cluster is 5 node: 2n+1, if you want to tolerate 2 failures then 2*2+1 = 5
  - Paxos does not scale very well, protocol is too chatty, can not force consistency on everything all the time, will not get performance.
  - “Ok everyone im writing to this, wait … ok i'm done now let's go”
- Chubby provides reliable backplane for saying whos in charge - just want to know who the master is. Nodes ask chubby who the master node is and if they are still alive, can become the new master this way.
Hadoop is public implementation of GFS and Chubby started by Yahoo, then Facebook now many, and implementation of this.
- Apache zookeeper is basically like chubby.
- Apache Hadoop Distributed File System
- Relation to AWS: Amazon has their own version of everything, does everything for you, AWS provides higher level infrastructure.
  - S3 is a data store, very cheap, NFS or block storage like file system is far more expensive. You do not know how S3 operates.