Notes
Lecture 10
----------
BigTable & MapReduce
Questions
BigTable
- single row transactions
- sparse semi-structured data, what do they mean?
MapReduce
- what happens when the master goes down?
- restarting tasks seems wasteful, why do that?
- single master, failure unlikely how?!
What's the problem MapReduce is trying to solve?
- how to process *large* amounts of data quickly
- really, summarize data
- think of this as web crawl data
example problem: what's the frequency of words on the entire web (as crawled by google?)
What's the strategy?
- map: analyze data in small chunks in parallel,
i.e., run a program on each bit of data separately
(example: have a program calculate word frequencies in
a web page)
- reduce: combine results to summarize
- must be an operation that doesn't depend on order
of operations
- get a list, produce a smaller list where every element is
of the same type as before
- example: combine word frequencies for individual pages
to get aggregate word frequencies
When fully reduced, you have word frequencies for the entire web.
Can every problem be expressed as a map and a reduce?
- no!
Can a lot of problems that people care about be expressed this way?
- yes!
map is embarassingly parallel
- so can be maximally sped up
- think of this as extracting the features you care about
reduce can be slow, but normally it shouldn't be that hard
- can't have order dependency though
- here we process the features, but in a non-order-dependent way
Why do pieces "start over" when a node crashes during a mapreduce operation?
- we don't have access to the local state that was computed
- local storage is way faster for this sort of thing!
- and even if you did, you can't trust that state
- because it could have been messed up for a while
What about master crashing?
- if it is a long running job, the app developer should
do their own checkpointing
- if it isn't, we don't care!
- and we can generally scale the machines used
(at a place like google) to make individual runs pretty quick
MapReduce is everywhere now
- sometimes too much, it can be used when the paradigm
doesn't really fit the task
- anything that has more data dependencies, i.e. depends on
order of execution
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
- "dwarves" of parallel computation
- they talk about what can be parallized and what can't
- most of what we talk about in this class is embararringly parallel
- except for the parts that aren't
BigTable
- "database" that makes use of parallel storage, i.e., GFS
- overall very parallel, but certain parts are not
- what is very centralized?
- look at Figure 4, most data is in SSTables stored in GFS, but the root isn't, it is in Chubby
Chubby gives robust config information
- the state of BigTable at a high level
- has to be very centralized, very available
Note that Chubby *doesn't* scale well, it only scales a bit
- like 5 machines
- but it scales enough to do what it needs to
- it is the part of the state of the system that
you can NEVER allow to be inconsistent
Parallel systems are hard because they can get partitioned in various ways and state can become inconsistent
- and communication to maintain consistency goes up exponentially with size if done naively
Chubby is a fully connected graph amongst its nodes
- all messages are seen by all nodes
But efficient parallel systems can't be fully connected
- way, way too much communication
How does BigTable handle mutable state?
- can you update arbitrary entries? not directly!
- data is immutable
- if you want to change something, you have to write it to a
new table and eventually consolidate to get rid of old data
- a bit like a log-structured filesystem
immutability is a very powerful idea in distributed systems
- if state can't change, you never have to communicate updates
Does this architecture have privacy implications?
- deletes don't immediately delete data
- modern solution: delete encryption key