EvoSec 2025W Lecture 6

From Soma-notes
Lecture 6
---------
 - GFS & Chubby
 - trust & security
 - projects, set a schedule

What is the problem they are solving? (Why were these built?)
 - for indexing the web!
 - i.e., download a copy of the web and process it

 - many web crawlers grabbing pages, images, etc and needing somewhere to store them

Only way to make this work is to have LOTS of computers storing LOTS of data in parallel
 - how to coordinate?

So in GFS, what is a file?
 - not a regular UNIX file!
 - files are made by appending records
   - each record must have a unique ID
   - because records can be written multiple times!
   

If there are any errors, operations are retried
 - if you want "only once" semantics, you have to go and check what happened
   to a failed write, slows things down

Master vs chunk servers
 - master stores metadata, chunk servers store data

So master server knows about chunk servers
 - guides replication process when chunk servers stop responding

But what happens if the master server goes down?
 - there are backup masters (shadows) that can take over at a moment's notice

But how does a client know which master is the right master?
 - that's where chubby comes in!

chubby is about coordination, not storing data
 - chubby is about approximating the consistency of a single computer with multiple computers
 - in terms of data consistency

Paxos is a consensus algorithm
 - make sure everyone has the same state at all times

Individual computers today are highly inconsistent inside
 - lots of concurrency
 - but hardware people take care of that, make it appear to be consistent
   e.g., variable always have the same value no matter when it is read
   - well not really, if you want consistent data under concurrency you need
     to locks (semaphores etc) using concurrency-enabled mechanisms (test & set, swap instructions)

GFS & chubby: Centralized, hierarchical trust
 - assume every system is behaving non-maliciously
 - but we're all happy servers in a Google server farm!

If Google assumed their servers were all untrustworthy, system would be MUCH less efficient
 - and in general, you can't even do that

By assuming more trust, you can get higher efficiency
 - because it is easier to coordinate activities

2% areas of interest
   Jan 31
3% elevator pitch
   Feb 13 (1 min blurb orally, 1-3 slides)
20% early lit review
   March 3
10% tests/preliminary work
   March 24 
5% presentation
   last two weeks (April 1, 3, 8)
30% research proposal/paper <-- end of exam period
   ("Take home exam")


What should be your project?
 - combine your interests with what we've discussed