|
|
Line 1: |
Line 1: |
| ==Notes== | | ==Notes== |
|
| |
| <pre>
| |
| Lecture 10
| |
| ----------
| |
| BigTable & MapReduce
| |
|
| |
| Questions
| |
|
| |
| BigTable
| |
| - single row transactions
| |
| - sparse semi-structured data, what do they mean?
| |
|
| |
| MapReduce
| |
| - what happens when the master goes down?
| |
| - restarting tasks seems wasteful, why do that?
| |
| - single master, failure unlikely how?!
| |
|
| |
| What's the problem MapReduce is trying to solve?
| |
| - how to process *large* amounts of data quickly
| |
| - really, summarize data
| |
| - think of this as web crawl data
| |
|
| |
| example problem: what's the frequency of words on the entire web (as crawled by google?)
| |
|
| |
| What's the strategy?
| |
| - map: analyze data in small chunks in parallel,
| |
| i.e., run a program on each bit of data separately
| |
|
| |
| (example: have a program calculate word frequencies in
| |
| a web page)
| |
| - reduce: combine results to summarize
| |
| - must be an operation that doesn't depend on order
| |
| of operations
| |
| - get a list, produce a smaller list where every element is
| |
| of the same type as before
| |
| - example: combine word frequencies for individual pages
| |
| to get aggregate word frequencies
| |
|
| |
| When fully reduced, you have word frequencies for the entire web.
| |
|
| |
| Can every problem be expressed as a map and a reduce?
| |
| - no!
| |
|
| |
| Can a lot of problems that people care about be expressed this way?
| |
| - yes!
| |
|
| |
| map is embarassingly parallel
| |
| - so can be maximally sped up
| |
| - think of this as extracting the features you care about
| |
|
| |
| reduce can be slow, but normally it shouldn't be that hard
| |
| - can't have order dependency though
| |
| - here we process the features, but in a non-order-dependent way
| |
|
| |
| Why do pieces "start over" when a node crashes during a mapreduce operation?
| |
| - we don't have access to the local state that was computed
| |
| - local storage is way faster for this sort of thing!
| |
| - and even if you did, you can't trust that state
| |
| - because it could have been messed up for a while
| |
|
| |
| What about master crashing?
| |
| - if it is a long running job, the app developer should
| |
| do their own checkpointing
| |
| - if it isn't, we don't care!
| |
| - and we can generally scale the machines used
| |
| (at a place like google) to make individual runs pretty quick
| |
|
| |
| MapReduce is everywhere now
| |
| - sometimes too much, it can be used when the paradigm
| |
| doesn't really fit the task
| |
| - anything that has more data dependencies, i.e. depends on
| |
| order of execution
| |
|
| |
| https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
| |
| - "dwarves" of parallel computation
| |
| - they talk about what can be parallized and what can't
| |
| - most of what we talk about in this class is embararringly parallel
| |
| - except for the parts that aren't
| |
|
| |
| BigTable
| |
| - "database" that makes use of parallel storage, i.e., GFS
| |
| - overall very parallel, but certain parts are not
| |
| - what is very centralized?
| |
| - look at Figure 4, most data is in SSTables stored in GFS, but the root isn't, it is in Chubby
| |
|
| |
| Chubby gives robust config information
| |
| - the state of BigTable at a high level
| |
| - has to be very centralized, very available
| |
|
| |
| Note that Chubby *doesn't* scale well, it only scales a bit
| |
| - like 5 machines
| |
| - but it scales enough to do what it needs to
| |
| - it is the part of the state of the system that
| |
| you can NEVER allow to be inconsistent
| |
|
| |
| Parallel systems are hard because they can get partitioned in various ways and state can become inconsistent
| |
| - and communication to maintain consistency goes up exponentially with size if done naively
| |
|
| |
| Chubby is a fully connected graph amongst its nodes
| |
| - all messages are seen by all nodes
| |
|
| |
| But efficient parallel systems can't be fully connected
| |
| - way, way too much communication
| |
|
| |
| How does BigTable handle mutable state?
| |
| - can you update arbitrary entries? not directly!
| |
| - data is immutable
| |
| - if you want to change something, you have to write it to a
| |
| new table and eventually consolidate to get rid of old data
| |
| - a bit like a log-structured filesystem
| |
|
| |
| immutability is a very powerful idea in distributed systems
| |
| - if state can't change, you never have to communicate updates
| |
|
| |
| Does this architecture have privacy implications?
| |
| - deletes don't immediately delete data
| |
| - modern solution: delete encryption key
| |
| </pre>
| |