DistOS 2021F 2021-10-14: Difference between revisions

From Soma-notes
Created page with "==Notes== <pre> Lecture 10 ---------- BigTable & MapReduce Questions BigTable - single row transactions - sparse semi-structured data, what do they mean? MapReduce - wh..."
 
No edit summary
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
==Notes==
==Notes==


<pre>
=== Big Picture (What the heck is this stuff?) ===
Lecture 10
----------
BigTable & MapReduce


Questions
* Think back the beginning of the course:
** Things were really mostly about taking ideas from Unix and making them distributed
** Folks did this at various levels of abstraction
** Some used the Unix kernel, some went off and did their own thing
* Now what are Borg, Omega and K8s doing?
** We’re no longer making Unix distributed – instead, we are distributing Unix
** The unit of computation is now a stripped-down Linux userspace -&gt; containers!


BigTable
What’s a container?
- single row transactions
- sparse semi-structured data, what do they mean?


MapReduce
* It’s just a collection of related processes and their resources
- what happens when the master goes down?
** Often only one process
- restarting tasks seems wasteful, why do that?
* The unifying abstractions in the Linux world are:
- single master, failure unlikely how?!
*# Namespaces
*# Control groups (cgroups) – These were actually invented at Google for Borg/Omega/K8s
* Namespaces provide private resource “mappings”
** So in my PID namespace my PID might be 1, while it might be 12345 in your PID namespace
** Your PID might not even exist in my PID namespace
* Control groups are used for:
** Resource accounting
** Resource management
** Think “quantifiable” resources, like CPU, I/O, memory
** These are often also used to canonically “group” processes together


What's the problem MapReduce is trying to solve?
What’s so special about a container here?
- how to process *large* amounts of data quickly
  - really, summarize data
- think of this as web crawl data


example problem: what's the frequency of words on the entire web (as crawled by google?)
* You get some ''really'' nice emergent properties from containers in a distributed context
* The OS-dependent interface is small (system calls and some system-wide parameters)
* Limited interference (when well-behaved)
* Do Linux containers provide real security benefits on their own? '''NO.''' (At least don't rely on them to isolate anything important on their own)


What's the strategy?
Abstraction is one of the most important themes in OS design
- map: analyze data in small chunks in parallel,
        i.e., run a program on each bit of data separately


        (example: have a program calculate word frequencies in
* Recall the goal of an operating system: We want to turn the computer into something you want to program on
a web page)
* Containers abstract away the details of the underlying system
- reduce: combine results to summarize
** This is super nice for deploying applications
    - must be an operation that doesn't depend on order
* Pods abstract away the details of orchestrating the containers
      of operations
** This is super nice for distributing applications
    - get a list, produce a smaller list where every element is
* ''Multiple'' layers of abstraction emerge, beyond the underlying OS kernel (which itself is full of them!)
      of the same type as before
    - example: combine word frequencies for individual pages
      to get aggregate word frequencies


When fully reduced, you have word frequencies for the entire web.
=== [https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43438.pdf Borg] ===


Can every problem be expressed as a map and a reduce?
* What basic problem does Borg try to solve? Google basically had the following end goals in mind:
- no!
** We want to distribute our applications
** We want them to use resources efficiently
** We don’t want long-running services to go down, become overwhelmed, or become resource-starved
** We want our applications to be ''hardware-agnostic'' and ''OS-agnostic'' (this means we need abstractions!)
* Kinds of workloads
** Both batch jobs and long-running services
** Prod vs non-prod (latency-sensitive vs not) -&gt; most batch jobs are non-prod while most services are prod
** Long-running services should be prioritized when scheduling – particularly if they are user-facing
* Work loads can be disparate, but run on shared resources
** This improves resource utilization
** Why is that a good thing?
** Time is money! More efficient utilization of computational resources reduces overhead cost
* Evolving internal tooling
** configuring + updating jobs
** predicting resource requirements
** pushing configuration
** service discovery + load balancing
* Tooling wasn’t really unified
** Most of it was developed ad-hoc as needed
** This is great, but the result is that we essentially lack any kind of consistency
** Lack of consistency in turn increases the barrier to entry for deploying new services on Borg
** Probably okay for internal use, but not really well-suited to end users
* Borg architecture:
** Compute clusters are broken into ''cells''
** Each cell can be made up of thousands of ''nodes''
** One Borgmaster per cell
** Many Borglets per cell
** N.B. Google typically only deploys one main cell on a given cluster, possibly with a few smaller cells for specific tasks
* Borgmaster election
** The Borgmaster abstraction presents as a single process
** But it’s actually replicated under the hood
** We choose one node as the ''elected master''
** Paxos allows us to establish consensus, Chubby then advertises who won the election
* Recovery from failure
** Borgmaster checkpoints -&gt; a snapshot + a change log for reconstruction
** Checkpoints live in the Paxos datastore
** Fauxmaster can be used to simulate a real system with checkpoints
** Good for user intervention and capacity planning
* Container orchestration
** IPs are per-host
** Every container gets its own port number (any problems?)
** Workload is broken up into ''jobs'' which have ''tasks'' (just a domain-specific word for containers)
** Jobs and tasks are irrevocably linked together (one cannot exist without the other)


Can a lot of problems that people care about be expressed this way?
=== [https://storage.googleapis.com/pub-tools-public-publication-data/pdf/41684.pdf Omega] ===
- yes!


map is embarassingly parallel
* Built to be a better Borg scheduler
  - so can be maximally sped up
** Architectural improvements
  - think of this as extracting the features you care about
** What were the problems with the original Borg scheduler?
* Storing internal state
** Transaction-oriented data store
** Paxos for consensus, fault-tolerance
** So far this is pretty similar to Borg, but….
** Optimistic concurrency control to handle conflicts
** Different parts of the cluster control plane access the data store
** The store is the ''only'' centralized component!
* How is their approach to internal state better than Borg?
** It enables them to break up the Borgmaster
** No longer a single point of failure
** Easier to scale and modify the control plane as needed
* Monolithic vs Two-Level vs Shared-State schedulers
** monolithic schedulers (generally) have no concurrency
** they use a single scheduling algorithm (it usually has to be both simple and fair)
** two-level schedulers have a component that passes off the resources to other schedulers
** ''two-level'' -&gt; we are “scheduling” our schedulers
** shared state schedulers operate on readily available copies of resources
** some important properties in Omega’s design:
**# lock-free
**# optimistic concurrency control
**# transaction-based
* What does “lock-free” mean?
** A lock-free algorithm guarantees that all cores are able to make progress at any given time – in other words, guaranteed system-wide progress
** A wait-free algorithm has stronger guarantees: no computation will ever be blocked by another computation – in other words, guaranteed per-thread progress
** As you can imagine, wait-free algorithms are very hard to come up with (but the trade-off we get for higher complexity is even better concurrency guarantees)


reduce can be slow, but normally it shouldn't be that hard
=== [https://queue.acm.org/detail.cfm?id=2898444 How Does K8s Fit in Here?] ===
- can't have order dependency though
- here we process the features, but in a non-order-dependent way


Why do pieces "start over" when a node crashes during a mapreduce operation?
* K8s is an Omega designed for end-users
- we don't have access to the local state that was computed
** Google wanted to sell more compute time on their cloud platform
    - local storage is way faster for this sort of thing!
** So they made the cloud more attractive to program on
- and even if you did, you can't trust that state
* Pods
    - because it could have been messed up for a while
** A higher level abstraction on top of deployment containers
** Pods are containers that encapsulate one or more inner containers
** Pod defines the available resources, inner containers provide the isolation
** (Borg had a similar concept called the “alloc” but this was optional)
* Borg relied on external tooling to handle many of its resource and cluster management needs
* On the other hand, K8s tries to make its API consistent, rather than ad-hoc
** Exposes a set of API objects, each with its own purpose and accompanying metadta
** API objects have a base object metadata, spec, and status
** Spec varies by object type and describes its desired state
** Status describes its current state
** Object metadata then defines a consistent way of labelling objects
* Replication and auto scaling
** K8s decouples these concepts
** Replica controller ensures that a given number of pods of a given type are always running
** The auto scaler then just adjusts the number of desired pods, and the replica controller sees this change
* Separation of concerns between K8s components
** This is the Unix philosophy!
** Composable components that do one thing and do it well
* Different flavours of API object for different purposes, e.g.
** ReplicationController -&gt; runs forever-replicated containers
** DaemonSet -&gt; a single pod instance on every node
** Job -&gt; a controller that can run an arbitrarily-scaled batch job from start to finish
* Reconciliation controllers
** Continually observe state in a loop
** Where discrepancies between current and desired state are detected, take action
** The actual controller is stateless
** This makes it fault-tolerant! If a controller fails, just let someone else pick up the slack
* All store accesses are forced through a centralized API server
** This helps to provide consistency in access
** Centralization makes it easy to enforce consistent semantics and provide abstractions


What about master crashing?
=== Discussion Questions ===
- if it is a long running job, the app developer should
  do their own checkpointing
- if it isn't, we don't care!
- and we can generally scale the machines used
  (at a place like google) to make individual runs pretty quick


MapReduce is everywhere now
What is so special about using the container as a unit of computation? Did the papers discuss any particular motivation for this design choice? How does this approach differ from earlier papers we have looked at in this course?
- sometimes too much, it can be used when the paradigm
  doesn't really fit the task
    - anything that has more data dependencies, i.e. depends on
      order of execution
   
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
- "dwarves" of parallel computation
- they talk about what can be parallized and what can't
- most of what we talk about in this class is embararringly parallel
    - except for the parts that aren't


BigTable
How do the containers described here compare with Solaris Zones? How do both approaches compare with vanilla Unix?
- "database" that makes use of parallel storage, i.e., GFS
- overall very parallel, but certain parts are not
    - what is very centralized?
    - look at Figure 4, most data is in SSTables stored in GFS, but the root isn't, it is in Chubby


Chubby gives robust config information
How does Borg compare with Omega in terms of its design and architecture? How do both systems compare with Kubernetes? Which of these systems seems the most user-friendly? Are any of them better-equipped to fill some obvious niche?
- the state of BigTable at a high level
- has to be very centralized, very available


Note that Chubby *doesn't* scale well, it only scales a bit
Why do you think Google still mostly relied on Borg internally (at the time of these papers), despite admitted shortcomings?
- like 5 machines
- but it scales enough to do what it needs to
- it is the part of the state of the system that
  you can NEVER allow to be inconsistent


Parallel systems are hard because they can get partitioned in various ways and state can become inconsistent
What motivated the design of three different container orchestration frameworks? Why not just modify Borg as needed?
- and communication to maintain consistency goes up exponentially with size if done naively


Chubby is a fully connected graph amongst its nodes
Did any tricky questions come up when reading the papers? Discuss these with your group and try to come up with possible answers to these questions.
- all messages are seen by all nodes


But efficient parallel systems can't be fully connected
Did you have any observations about a particular design choice (good or bad)? Try to reach an agreement as a group on a particular design choice. If you can’t reach an agreement, talk about why your opinions differ.
- way, way too much communication


How does BigTable handle mutable state?
=== K8s Demo ===
- can you update arbitrary entries?  not directly!
    - data is immutable
- if you want to change something, you have to write it to a
  new table and eventually consolidate to get rid of old data
    - a bit like a log-structured filesystem


immutability is a very powerful idea in distributed systems
* Cluster setup, Openstack setup, getting ready for experience 1
- if state can't change, you never have to communicate updates
* Demonstrate some specific parts of experience 1, without giving anything away
 
* Any specific requests?
Does this architecture have privacy implications?
- deletes don't immediately delete data
  - modern solution: delete encryption key
</pre>

Latest revision as of 23:46, 14 October 2021

Notes

Big Picture (What the heck is this stuff?)

  • Think back the beginning of the course:
    • Things were really mostly about taking ideas from Unix and making them distributed
    • Folks did this at various levels of abstraction
    • Some used the Unix kernel, some went off and did their own thing
  • Now what are Borg, Omega and K8s doing?
    • We’re no longer making Unix distributed – instead, we are distributing Unix
    • The unit of computation is now a stripped-down Linux userspace -> containers!

What’s a container?

  • It’s just a collection of related processes and their resources
    • Often only one process
  • The unifying abstractions in the Linux world are:
    1. Namespaces
    2. Control groups (cgroups) – These were actually invented at Google for Borg/Omega/K8s
  • Namespaces provide private resource “mappings”
    • So in my PID namespace my PID might be 1, while it might be 12345 in your PID namespace
    • Your PID might not even exist in my PID namespace
  • Control groups are used for:
    • Resource accounting
    • Resource management
    • Think “quantifiable” resources, like CPU, I/O, memory
    • These are often also used to canonically “group” processes together

What’s so special about a container here?

  • You get some really nice emergent properties from containers in a distributed context
  • The OS-dependent interface is small (system calls and some system-wide parameters)
  • Limited interference (when well-behaved)
  • Do Linux containers provide real security benefits on their own? NO. (At least don't rely on them to isolate anything important on their own)

Abstraction is one of the most important themes in OS design

  • Recall the goal of an operating system: We want to turn the computer into something you want to program on
  • Containers abstract away the details of the underlying system
    • This is super nice for deploying applications
  • Pods abstract away the details of orchestrating the containers
    • This is super nice for distributing applications
  • Multiple layers of abstraction emerge, beyond the underlying OS kernel (which itself is full of them!)

Borg

  • What basic problem does Borg try to solve? Google basically had the following end goals in mind:
    • We want to distribute our applications
    • We want them to use resources efficiently
    • We don’t want long-running services to go down, become overwhelmed, or become resource-starved
    • We want our applications to be hardware-agnostic and OS-agnostic (this means we need abstractions!)
  • Kinds of workloads
    • Both batch jobs and long-running services
    • Prod vs non-prod (latency-sensitive vs not) -> most batch jobs are non-prod while most services are prod
    • Long-running services should be prioritized when scheduling – particularly if they are user-facing
  • Work loads can be disparate, but run on shared resources
    • This improves resource utilization
    • Why is that a good thing?
    • Time is money! More efficient utilization of computational resources reduces overhead cost
  • Evolving internal tooling
    • configuring + updating jobs
    • predicting resource requirements
    • pushing configuration
    • service discovery + load balancing
  • Tooling wasn’t really unified
    • Most of it was developed ad-hoc as needed
    • This is great, but the result is that we essentially lack any kind of consistency
    • Lack of consistency in turn increases the barrier to entry for deploying new services on Borg
    • Probably okay for internal use, but not really well-suited to end users
  • Borg architecture:
    • Compute clusters are broken into cells
    • Each cell can be made up of thousands of nodes
    • One Borgmaster per cell
    • Many Borglets per cell
    • N.B. Google typically only deploys one main cell on a given cluster, possibly with a few smaller cells for specific tasks
  • Borgmaster election
    • The Borgmaster abstraction presents as a single process
    • But it’s actually replicated under the hood
    • We choose one node as the elected master
    • Paxos allows us to establish consensus, Chubby then advertises who won the election
  • Recovery from failure
    • Borgmaster checkpoints -> a snapshot + a change log for reconstruction
    • Checkpoints live in the Paxos datastore
    • Fauxmaster can be used to simulate a real system with checkpoints
    • Good for user intervention and capacity planning
  • Container orchestration
    • IPs are per-host
    • Every container gets its own port number (any problems?)
    • Workload is broken up into jobs which have tasks (just a domain-specific word for containers)
    • Jobs and tasks are irrevocably linked together (one cannot exist without the other)

Omega

  • Built to be a better Borg scheduler
    • Architectural improvements
    • What were the problems with the original Borg scheduler?
  • Storing internal state
    • Transaction-oriented data store
    • Paxos for consensus, fault-tolerance
    • So far this is pretty similar to Borg, but….
    • Optimistic concurrency control to handle conflicts
    • Different parts of the cluster control plane access the data store
    • The store is the only centralized component!
  • How is their approach to internal state better than Borg?
    • It enables them to break up the Borgmaster
    • No longer a single point of failure
    • Easier to scale and modify the control plane as needed
  • Monolithic vs Two-Level vs Shared-State schedulers
    • monolithic schedulers (generally) have no concurrency
    • they use a single scheduling algorithm (it usually has to be both simple and fair)
    • two-level schedulers have a component that passes off the resources to other schedulers
    • two-level -> we are “scheduling” our schedulers
    • shared state schedulers operate on readily available copies of resources
    • some important properties in Omega’s design:
      1. lock-free
      2. optimistic concurrency control
      3. transaction-based
  • What does “lock-free” mean?
    • A lock-free algorithm guarantees that all cores are able to make progress at any given time – in other words, guaranteed system-wide progress
    • A wait-free algorithm has stronger guarantees: no computation will ever be blocked by another computation – in other words, guaranteed per-thread progress
    • As you can imagine, wait-free algorithms are very hard to come up with (but the trade-off we get for higher complexity is even better concurrency guarantees)

How Does K8s Fit in Here?

  • K8s is an Omega designed for end-users
    • Google wanted to sell more compute time on their cloud platform
    • So they made the cloud more attractive to program on
  • Pods
    • A higher level abstraction on top of deployment containers
    • Pods are containers that encapsulate one or more inner containers
    • Pod defines the available resources, inner containers provide the isolation
    • (Borg had a similar concept called the “alloc” but this was optional)
  • Borg relied on external tooling to handle many of its resource and cluster management needs
  • On the other hand, K8s tries to make its API consistent, rather than ad-hoc
    • Exposes a set of API objects, each with its own purpose and accompanying metadta
    • API objects have a base object metadata, spec, and status
    • Spec varies by object type and describes its desired state
    • Status describes its current state
    • Object metadata then defines a consistent way of labelling objects
  • Replication and auto scaling
    • K8s decouples these concepts
    • Replica controller ensures that a given number of pods of a given type are always running
    • The auto scaler then just adjusts the number of desired pods, and the replica controller sees this change
  • Separation of concerns between K8s components
    • This is the Unix philosophy!
    • Composable components that do one thing and do it well
  • Different flavours of API object for different purposes, e.g.
    • ReplicationController -> runs forever-replicated containers
    • DaemonSet -> a single pod instance on every node
    • Job -> a controller that can run an arbitrarily-scaled batch job from start to finish
  • Reconciliation controllers
    • Continually observe state in a loop
    • Where discrepancies between current and desired state are detected, take action
    • The actual controller is stateless
    • This makes it fault-tolerant! If a controller fails, just let someone else pick up the slack
  • All store accesses are forced through a centralized API server
    • This helps to provide consistency in access
    • Centralization makes it easy to enforce consistent semantics and provide abstractions

Discussion Questions

What is so special about using the container as a unit of computation? Did the papers discuss any particular motivation for this design choice? How does this approach differ from earlier papers we have looked at in this course?

How do the containers described here compare with Solaris Zones? How do both approaches compare with vanilla Unix?

How does Borg compare with Omega in terms of its design and architecture? How do both systems compare with Kubernetes? Which of these systems seems the most user-friendly? Are any of them better-equipped to fill some obvious niche?

Why do you think Google still mostly relied on Borg internally (at the time of these papers), despite admitted shortcomings?

What motivated the design of three different container orchestration frameworks? Why not just modify Borg as needed?

Did any tricky questions come up when reading the papers? Discuss these with your group and try to come up with possible answers to these questions.

Did you have any observations about a particular design choice (good or bad)? Try to reach an agreement as a group on a particular design choice. If you can’t reach an agreement, talk about why your opinions differ.

K8s Demo

  • Cluster setup, Openstack setup, getting ready for experience 1
  • Demonstrate some specific parts of experience 1, without giving anything away
  • Any specific requests?