DistOS 2021F 2021-10-14

Notes

Big Picture (What the heck is this stuff?)

Think back the beginning of the course:
- Things were really mostly about taking ideas from Unix and making them distributed
- Folks did this at various levels of abstraction
- Some used the Unix kernel, some went off and did their own thing
Now what are Borg, Omega and K8s doing?
- We’re no longer making Unix distributed – instead, we are distributing Unix
- The unit of computation is now a stripped-down Linux userspace -> containers!

What’s a container?

It’s just a collection of related processes and their resources
- Often only one process
The unifying abstractions in the Linux world are:
1. Namespaces
2. Control groups (cgroups) – These were actually invented at Google for Borg/Omega/K8s
Namespaces provide private resource “mappings”
- So in my PID namespace my PID might be 1, while it might be 12345 in your PID namespace
- Your PID might not even exist in my PID namespace
Control groups are used for:
- Resource accounting
- Resource management
- Think “quantifiable” resources, like CPU, I/O, memory
- These are often also used to canonically “group” processes together

What’s so special about a container here?

You get some really nice emergent properties from containers in a distributed context
The OS-dependent interface is small (system calls and some system-wide parameters)
Limited interference (when well-behaved)
Do Linux containers provide real security benefits on their own? NO. (At least don't rely on them to isolate anything important on their own)

Abstraction is one of the most important themes in OS design

Recall the goal of an operating system: We want to turn the computer into something you want to program on
Containers abstract away the details of the underlying system
- This is super nice for deploying applications
Pods abstract away the details of orchestrating the containers
- This is super nice for distributing applications
Multiple layers of abstraction emerge, beyond the underlying OS kernel (which itself is full of them!)

Borg

What basic problem does Borg try to solve? Google basically had the following end goals in mind:
- We want to distribute our applications
- We want them to use resources efficiently
- We don’t want long-running services to go down, become overwhelmed, or become resource-starved
- We want our applications to be hardware-agnostic and OS-agnostic (this means we need abstractions!)
Kinds of workloads
- Both batch jobs and long-running services
- Prod vs non-prod (latency-sensitive vs not) -> most batch jobs are non-prod while most services are prod
- Long-running services should be prioritized when scheduling – particularly if they are user-facing
Work loads can be disparate, but run on shared resources
- This improves resource utilization
- Why is that a good thing?
- Time is money! More efficient utilization of computational resources reduces overhead cost
Evolving internal tooling
- configuring + updating jobs
- predicting resource requirements
- pushing configuration
- service discovery + load balancing
Tooling wasn’t really unified
- Most of it was developed ad-hoc as needed
- This is great, but the result is that we essentially lack any kind of consistency
- Lack of consistency in turn increases the barrier to entry for deploying new services on Borg
- Probably okay for internal use, but not really well-suited to end users
Borg architecture:
- Compute clusters are broken into cells
- Each cell can be made up of thousands of nodes
- One Borgmaster per cell
- Many Borglets per cell
- N.B. Google typically only deploys one main cell on a given cluster, possibly with a few smaller cells for specific tasks
Borgmaster election
- The Borgmaster abstraction presents as a single process
- But it’s actually replicated under the hood
- We choose one node as the elected master
- Paxos allows us to establish consensus, Chubby then advertises who won the election
Recovery from failure
- Borgmaster checkpoints -> a snapshot + a change log for reconstruction
- Checkpoints live in the Paxos datastore
- Fauxmaster can be used to simulate a real system with checkpoints
- Good for user intervention and capacity planning
Container orchestration
- IPs are per-host
- Every container gets its own port number (any problems?)
- Workload is broken up into jobs which have tasks (just a domain-specific word for containers)
- Jobs and tasks are irrevocably linked together (one cannot exist without the other)

Omega

Built to be a better Borg scheduler
- Architectural improvements
- What were the problems with the original Borg scheduler?
Storing internal state
- Transaction-oriented data store
- Paxos for consensus, fault-tolerance
- So far this is pretty similar to Borg, but….
- Optimistic concurrency control to handle conflicts
- Different parts of the cluster control plane access the data store
- The store is the only centralized component!
How is their approach to internal state better than Borg?
- It enables them to break up the Borgmaster
- No longer a single point of failure
- Easier to scale and modify the control plane as needed
Monolithic vs Two-Level vs Shared-State schedulers
- monolithic schedulers (generally) have no concurrency
- they use a single scheduling algorithm (it usually has to be both simple and fair)
- two-level schedulers have a component that passes off the resources to other schedulers
- two-level -> we are “scheduling” our schedulers
- shared state schedulers operate on readily available copies of resources
- some important properties in Omega’s design:
  1. lock-free
  2. optimistic concurrency control
  3. transaction-based
What does “lock-free” mean?
- A lock-free algorithm guarantees that all cores are able to make progress at any given time – in other words, guaranteed system-wide progress
- A wait-free algorithm has stronger guarantees: no computation will ever be blocked by another computation – in other words, guaranteed per-thread progress
- As you can imagine, wait-free algorithms are very hard to come up with (but the trade-off we get for higher complexity is even better concurrency guarantees)

How Does K8s Fit in Here?

K8s is an Omega designed for end-users
- Google wanted to sell more compute time on their cloud platform
- So they made the cloud more attractive to program on
Pods
- A higher level abstraction on top of deployment containers
- Pods are containers that encapsulate one or more inner containers
- Pod defines the available resources, inner containers provide the isolation
- (Borg had a similar concept called the “alloc” but this was optional)
Borg relied on external tooling to handle many of its resource and cluster management needs
On the other hand, K8s tries to make its API consistent, rather than ad-hoc
- Exposes a set of API objects, each with its own purpose and accompanying metadta
- API objects have a base object metadata, spec, and status
- Spec varies by object type and describes its desired state
- Status describes its current state
- Object metadata then defines a consistent way of labelling objects
Replication and auto scaling
- K8s decouples these concepts
- Replica controller ensures that a given number of pods of a given type are always running
- The auto scaler then just adjusts the number of desired pods, and the replica controller sees this change
Separation of concerns between K8s components
- This is the Unix philosophy!
- Composable components that do one thing and do it well
Different flavours of API object for different purposes, e.g.
- ReplicationController -> runs forever-replicated containers
- DaemonSet -> a single pod instance on every node
- Job -> a controller that can run an arbitrarily-scaled batch job from start to finish
Reconciliation controllers
- Continually observe state in a loop
- Where discrepancies between current and desired state are detected, take action
- The actual controller is stateless
- This makes it fault-tolerant! If a controller fails, just let someone else pick up the slack
All store accesses are forced through a centralized API server
- This helps to provide consistency in access
- Centralization makes it easy to enforce consistent semantics and provide abstractions

Discussion Questions

What is so special about using the container as a unit of computation? Did the papers discuss any particular motivation for this design choice? How does this approach differ from earlier papers we have looked at in this course?

How do the containers described here compare with Solaris Zones? How do both approaches compare with vanilla Unix?

How does Borg compare with Omega in terms of its design and architecture? How do both systems compare with Kubernetes? Which of these systems seems the most user-friendly? Are any of them better-equipped to fill some obvious niche?

Why do you think Google still mostly relied on Borg internally (at the time of these papers), despite admitted shortcomings?

What motivated the design of three different container orchestration frameworks? Why not just modify Borg as needed?

Did any tricky questions come up when reading the papers? Discuss these with your group and try to come up with possible answers to these questions.

Did you have any observations about a particular design choice (good or bad)? Try to reach an agreement as a group on a particular design choice. If you can’t reach an agreement, talk about why your opinions differ.

K8s Demo

Cluster setup, Openstack setup, getting ready for experience 1
Demonstrate some specific parts of experience 1, without giving anything away
Any specific requests?