DistOS 2021F 2021-10-14: Difference between revisions

Newer edit →

@@ Line 1: / Line 1: @@
 ==Notes==
-<pre>
-Lecture 10
-----------
-BigTable & MapReduce
-Questions
-BigTable
- - single row transactions
- - sparse semi-structured data, what do they mean?
-MapReduce
- - what happens when the master goes down?
- - restarting tasks seems wasteful, why do that?
- - single master, failure unlikely how?!
-What's the problem MapReduce is trying to solve?
- - how to process *large* amounts of data quickly
-   - really, summarize data
- - think of this as web crawl data
-example problem: what's the frequency of words on the entire web (as crawled by google?)
-What's the strategy?
- - map: analyze data in small chunks in parallel,
-        i.e., run a program on each bit of data separately
-        (example: have a program calculate word frequencies in
-	 a web page)
- - reduce: combine results to summarize
-    - must be an operation that doesn't depend on order
-      of operations
-    - get a list, produce a smaller list where every element is
-      of the same type as before
-    - example: combine word frequencies for individual pages
-      to get aggregate word frequencies
-When fully reduced, you have word frequencies for the entire web.
-Can every problem be expressed as a map and a reduce?
- - no!
-Can a lot of problems that people care about be expressed this way?
- - yes!
-map is embarassingly parallel
-  - so can be maximally sped up
-  - think of this as extracting the features you care about
-reduce can be slow, but normally it shouldn't be that hard
- - can't have order dependency though
- - here we process the features, but in a non-order-dependent way
-Why do pieces "start over" when a node crashes during a mapreduce operation?
- - we don't have access to the local state that was computed
-    - local storage is way faster for this sort of thing!
- - and even if you did, you can't trust that state
-    - because it could have been messed up for a while
-What about master crashing?
- - if it is a long running job, the app developer should
-   do their own checkpointing
- - if it isn't, we don't care!
- - and we can generally scale the machines used
-   (at a place like google) to make individual runs pretty quick
-MapReduce is everywhere now
- - sometimes too much, it can be used when the paradigm
-   doesn't really fit the task
-    - anything that has more data dependencies, i.e. depends on
-      order of execution
-https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
- - "dwarves" of parallel computation
- - they talk about what can be parallized and what can't
- - most of what we talk about in this class is embararringly parallel
-    - except for the parts that aren't
-BigTable
- - "database" that makes use of parallel storage, i.e., GFS
- - overall very parallel, but certain parts are not
-    - what is very centralized?
-    - look at Figure 4, most data is in SSTables stored in GFS, but the root isn't, it is in Chubby
-Chubby gives robust config information
- - the state of BigTable at a high level
- - has to be very centralized, very available
-Note that Chubby *doesn't* scale well, it only scales a bit
- - like 5 machines
- - but it scales enough to do what it needs to
- - it is the part of the state of the system that
-   you can NEVER allow to be inconsistent
-Parallel systems are hard because they can get partitioned in various ways and state can become inconsistent
- - and communication to maintain consistency goes up exponentially with size if done naively
-Chubby is a fully connected graph amongst its nodes
- - all messages are seen by all nodes
-But efficient parallel systems can't be fully connected
- - way, way too much communication
-How does BigTable handle mutable state?
- - can you update arbitrary entries?  not directly!
-    - data is immutable
- - if you want to change something, you have to write it to a
-   new table and eventually consolidate to get rid of old data
-    - a bit like a log-structured filesystem
-immutability is a very powerful idea in distributed systems
- - if state can't change, you never have to communicate updates
-Does this architecture have privacy implications?
- - deletes don't immediately delete data
-   - modern solution: delete encryption key
-</pre>

Revision as of 02:24, 13 October 2021

Notes