Difference between revisions of "DistOS 2021F 2021-10-12"

From Soma-notes
Jump to navigation Jump to search
(Created page with "==Notes== <pre> Lecture 9 --------- Plan for next week for groups - generated by a script - posted at start of class, you'll need to manually go to your assigned room...")
 
 
Line 2: Line 2:


<pre>
<pre>
Lecture 9
Lecture 10
---------
----------
BigTable & MapReduce


Plan for next week for groups
Questions
- generated by a script
- posted at start of class, you'll need to
  manually go to your assigned room
- based on what you submit
- 4 groups:
  - high response grades
  - low response grades
  - high quiz grades
  - low quiz grades
- groups will be first assigned inside each group randomly and then
  based on nearby groups (high with high, low with low)
- will respect group exclusion list


BigTable
- single row transactions
- sparse semi-structured data, what do they mean?


Zookeeper & Chubby
MapReduce
  - why do we need these?
  - what happens when the master goes down?
  - and why don't we have them on individual systems, or on older
- restarting tasks seems wasteful, why do that?
  distributed systems (Plan 9, NFS, LOCUS etc)
  - single master, failure unlikely how?!


We need them because
What's the problem MapReduce is trying to solve?
  - failures: hosts die, so we need more than one to run the service
  - how to process *large* amounts of data quickly
   - so need to be able to recover
   - really, summarize data
  - but really, it is *because* it is distributed that we need these complex algorithms
  - think of this as web crawl data


Servers A and B
example problem: what's the frequency of words on the entire web (as crawled by google?)
Clients X and Y


A and B are supposed to provide access to the "same" data
What's the strategy?
- map: analyze data in small chunks in parallel,
        i.e., run a program on each bit of data separately


Client X reads and writes to A
        (example: have a program calculate word frequencies in
Client Y reads and writes to B
a web page)
- reduce: combine results to summarize
    - must be an operation that doesn't depend on order
      of operations
    - get a list, produce a smaller list where every element is
      of the same type as before
    - example: combine word frequencies for individual pages
      to get aggregate word frequencies


"like networked mutex-semaphore systems"
When fully reduced, you have word frequencies for the entire web.
  - except there is NO SHARED MEMORY


You want multiple hosts to behave as if they were the same host
Can every problem be expressed as a map and a reduce?
  - unified view of state that, externally, is always consistent
  - no!


With older systems, if you wanted a unified view of state, it would just go to one host
Can a lot of problems that people care about be expressed this way?
  - but here we need more than one, for scalability, performance, and reliability
  - yes!


Chubby really takes this seriously
map is embarassingly parallel
Zookeeper can get these sorts of guarantees but allows flexibility
  - so can be maximally sped up
- trade off consistency for performance
  - think of this as extracting the features you care about


consistency: act like copies are "the same"
reduce can be slow, but normally it shouldn't be that hard
  - changes are visible everywhere at the same time
- can't have order dependency though
- here we process the features, but in a non-order-dependent way


Classic single system operating systems take shared state as a given
Why do pieces "start over" when a node crashes during a mapreduce operation?
  - get in for free with shared RAM
  - we don't have access to the local state that was computed
  - in NUMA systems (i.e., any modern system with multiple cores),
    - local storage is way faster for this sort of thing!
  this can take a hit, but the hardware makes it mostly true
  - and even if you did, you can't trust that state
  - and when it isn't we can use special CPU instructions to
    - because it could have been messed up for a while
    make sure it is true


Note that Zookeeper is open source and is widely used
What about master crashing?
  - Chubby is google-specific
  - if it is a long running job, the app developer should
  - and I don't know of an open source implementation of it
  do their own checkpointing
  - if it isn't, we don't care!
- and we can generally scale the machines used
  (at a place like google) to make individual runs pretty quick


Hadoop project is a gathering of projects for building large-scale distributed systems
MapReduce is everywhere now
  - many based on papers published by Google
- sometimes too much, it can be used when the paradigm
  - all developed by parties other than Google, but major internet corporations (Yahoo was one of these, but continued with Facebook etc)
  doesn't really fit the task
    - anything that has more data dependencies, i.e. depends on
      order of execution
   
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
- "dwarves" of parallel computation
  - they talk about what can be parallized and what can't
  - most of what we talk about in this class is embararringly parallel
    - except for the parts that aren't


Kubernetes was developed by Google
BigTable
  - I think they finally realized that people were developing tech
  - "database" that makes use of parallel storage, i.e., GFS
  outside their bubble and it meant that hires would know others'
- overall very parallel, but certain parts are not
  tech, not theirs
    - what is very centralized?
    - look at Figure 4, most data is in SSTables stored in GFS, but the root isn't, it is in Chubby


Note that Google doesn't use Kubernetes internally
Chubby gives robust config information
  - they have Borg, which we will discuss next week
  - the state of BigTable at a high level
- has to be very centralized, very available


Sometimes you need everyone to be on the same page, that's when you use these sorts of services
Note that Chubby *doesn't* scale well, it only scales a bit
  - not for data, but for metadata, config, etc
  - like 5 machines
  - they pay a huge price for consistency guarantees
  - but it scales enough to do what it needs to
    - limited data storage
- it is the part of the state of the system that
    - limited access methods
  you can NEVER allow to be inconsistent
    - limited scalability (they don't scale per se, they allow
      other systems to scale)


Chubby turns 5 computers into 1 file server
Parallel systems are hard because they can get partitioned in various ways and state can become inconsistent
  - but not really more than 5
- and communication to maintain consistency goes up exponentially with size if done naively
  - but 5 is enough for fault tolerance, and load is relatively low
    because "locks" are coarse-grained
  - workload could be served by fewer machines, but these can be
    distributed around infrastructure


Zookeeper is based on Zab
Chubby is a fully connected graph amongst its nodes
Chubby is based on Paxos
- all messages are seen by all nodes
etcd is based on Raft


Each algorithm has its tradeoffs
But efficient parallel systems can't be fully connected
  - importance is you need something that provides a consistent view to build larger distributed systems
  - way, way too much communication
  - seems to be easier if you do this as a service rather than a library
 
   an application uses
How does BigTable handle mutable state?
- can you update arbitrary entries?  not directly!
    - data is immutable
- if you want to change something, you have to write it to a
  new table and eventually consolidate to get rid of old data
    - a bit like a log-structured filesystem
 
immutability is a very powerful idea in distributed systems
  - if state can't change, you never have to communicate updates
 
Does this architecture have privacy implications?
- deletes don't immediately delete data
   - modern solution: delete encryption key
</pre>
</pre>

Latest revision as of 22:24, 12 October 2021

Notes

Lecture 10
----------
BigTable & MapReduce

Questions

BigTable
 - single row transactions
 - sparse semi-structured data, what do they mean?

MapReduce
 - what happens when the master goes down?
 - restarting tasks seems wasteful, why do that?
 - single master, failure unlikely how?!

What's the problem MapReduce is trying to solve?
 - how to process *large* amounts of data quickly
   - really, summarize data
 - think of this as web crawl data

example problem: what's the frequency of words on the entire web (as crawled by google?)

What's the strategy?
 - map: analyze data in small chunks in parallel,
        i.e., run a program on each bit of data separately

        (example: have a program calculate word frequencies in
	 a web page)
 - reduce: combine results to summarize
    - must be an operation that doesn't depend on order
      of operations
    - get a list, produce a smaller list where every element is
      of the same type as before
    - example: combine word frequencies for individual pages
      to get aggregate word frequencies

When fully reduced, you have word frequencies for the entire web.

Can every problem be expressed as a map and a reduce?
 - no!

Can a lot of problems that people care about be expressed this way?
 - yes!

map is embarassingly parallel
  - so can be maximally sped up
  - think of this as extracting the features you care about

reduce can be slow, but normally it shouldn't be that hard
 - can't have order dependency though
 - here we process the features, but in a non-order-dependent way

Why do pieces "start over" when a node crashes during a mapreduce operation?
 - we don't have access to the local state that was computed
    - local storage is way faster for this sort of thing!
 - and even if you did, you can't trust that state
    - because it could have been messed up for a while

What about master crashing?
 - if it is a long running job, the app developer should
   do their own checkpointing
 - if it isn't, we don't care!
 - and we can generally scale the machines used
   (at a place like google) to make individual runs pretty quick

MapReduce is everywhere now
 - sometimes too much, it can be used when the paradigm
   doesn't really fit the task
    - anything that has more data dependencies, i.e. depends on
      order of execution
    
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
 - "dwarves" of parallel computation
 - they talk about what can be parallized and what can't
 - most of what we talk about in this class is embararringly parallel
    - except for the parts that aren't

BigTable
 - "database" that makes use of parallel storage, i.e., GFS
 - overall very parallel, but certain parts are not
    - what is very centralized?
    - look at Figure 4, most data is in SSTables stored in GFS, but the root isn't, it is in Chubby

Chubby gives robust config information
 - the state of BigTable at a high level
 - has to be very centralized, very available

Note that Chubby *doesn't* scale well, it only scales a bit
 - like 5 machines
 - but it scales enough to do what it needs to
 - it is the part of the state of the system that
   you can NEVER allow to be inconsistent

Parallel systems are hard because they can get partitioned in various ways and state can become inconsistent
 - and communication to maintain consistency goes up exponentially with size if done naively

Chubby is a fully connected graph amongst its nodes
 - all messages are seen by all nodes

But efficient parallel systems can't be fully connected
 - way, way too much communication

How does BigTable handle mutable state?
 - can you update arbitrary entries?  not directly!
    - data is immutable
 - if you want to change something, you have to write it to a
   new table and eventually consolidate to get rid of old data
    - a bit like a log-structured filesystem

immutability is a very powerful idea in distributed systems
 - if state can't change, you never have to communicate updates

Does this architecture have privacy implications?
 - deletes don't immediately delete data
   - modern solution: delete encryption key