Lecture 6
---------
- GFS & Chubby
- trust & security
- projects, set a schedule
What is the problem they are solving? (Why were these built?)
- for indexing the web!
- i.e., download a copy of the web and process it
- many web crawlers grabbing pages, images, etc and needing somewhere to store them
Only way to make this work is to have LOTS of computers storing LOTS of data in parallel
- how to coordinate?
So in GFS, what is a file?
- not a regular UNIX file!
- files are made by appending records
- each record must have a unique ID
- because records can be written multiple times!
If there are any errors, operations are retried
- if you want "only once" semantics, you have to go and check what happened
to a failed write, slows things down
Master vs chunk servers
- master stores metadata, chunk servers store data
So master server knows about chunk servers
- guides replication process when chunk servers stop responding
But what happens if the master server goes down?
- there are backup masters (shadows) that can take over at a moment's notice
But how does a client know which master is the right master?
- that's where chubby comes in!
chubby is about coordination, not storing data
- chubby is about approximating the consistency of a single computer with multiple computers
- in terms of data consistency
Paxos is a consensus algorithm
- make sure everyone has the same state at all times
Individual computers today are highly inconsistent inside
- lots of concurrency
- but hardware people take care of that, make it appear to be consistent
e.g., variable always have the same value no matter when it is read
- well not really, if you want consistent data under concurrency you need
to locks (semaphores etc) using concurrency-enabled mechanisms (test & set, swap instructions)
GFS & chubby: Centralized, hierarchical trust
- assume every system is behaving non-maliciously
- but we're all happy servers in a Google server farm!
If Google assumed their servers were all untrustworthy, system would be MUCH less efficient
- and in general, you can't even do that
By assuming more trust, you can get higher efficiency
- because it is easier to coordinate activities
2% areas of interest
Jan 31
3% elevator pitch
Feb 13 (1 min blurb orally, 1-3 slides)
20% early lit review
March 3
10% tests/preliminary work
March 24
5% presentation
last two weeks (April 1, 3, 8)
30% research proposal/paper <-- end of exam period
("Take home exam")
What should be your project?
- combine your interests with what we've discussed