DistOS 2021F 2021-09-09

Notes

Lecture 1
---------

Outline stuff
 - grad students have to do a project
 - undergrads can do exam only
 - reading responses or quiz is for every assigned reading
   - make sure you read it
   - can always turn in a response by the night before
   - quiz (when available) should be turned in an hour before class time
   - rationale: you should prove you've read things, and I need extra
     time to read responses before class
   - meant to be participation grades, should mostly be getting 100%

group discussions
 - will happen most classes
 - when they do, you should submit a group report
 - normally, will take up half of a class
 - group reports are one per group
 - groups will normally be randomized each time
   - but I will allow a blacklist for each of you
 - groups will be normally 3-4 people

implementation experiences
 - this is new!
 - idea is two of these over the term where you'll play with kubernetes
 - grades will be based on tasks completed
   - you'll know what you need to do to get an A
 - if you decide to do an implementation-focused project, you'll probably want to build on what you learned in the experiences


Projects
 - can be implementation-based, literature review, or research proposal
 - should be individual normally.  I will allow pairs, but then I expect 2x the work, and it should be clear who did what


What is an operating system?
 - "controls and multiplexes access to resources"
 - turns the computer you have into the computer that you want to program
   - abstraction, resource management, protection

A distributed OS is an operating system that does the above, but for multiple
computers connected by a network

So, is a distributed OS harder to make than a regular OS?
 - YES
 - but why?

In general, is it easy to parallize a sequential algorithm efficiently?
 - not at all

Using 10 systems to accomplish a task doesn't mean we'll get a 10x speedup over one system

Problem is coordination (synchronization is part of this)
  - how to split up the problem
  - how to communicate about the work as it is done
  - latency, bandwidth impact performance speedups
 
resource management is hard because in a big enough distributed system, things fail all the time
 - so you have to achieve fault tolerance somehow
 - communication, computation, and storage can all fail in arbitrary ways

on single systems, failures happen but most are masked/recovered from at the hardware level
 - can't do this in a distributed system easily

heterogeneity is also an issue, can lead to compatibility issues, but that can be managed
 - difficult but not fundamentally hard

In fact, most of what distributed OSs support are tasks that are embarassingly parallel
 - in other words, they can be decomposed almost arbitrarily and distributed without significant performance penalty
 - if it isn't embarassingly parallel, it probably won't scale

Modern day cloud infrastructures are primitive implementations of a distributed OS
  - definitely do resource management, but abstraction is often poor
    - you have to manage the systems individually

The real secret of distributed OSs is that we can't solve problems generally
 - instead, we come up with good enough solutions to specific problems
 - have to specialize in order to circumvent the limitations of being distributed
   - the distributed nature of the system leaks unless you limit your scope

So, in a sense distributed OSs don't exist the way single system OSs do
  - nothing as general as Linux or Windows

Things like kubernetes is as close as we get
  - but as you'll see, the abstraction level is very poor

We are going to talk about containers a lot!

We'll also discuss virtual machines

Embarassingly parallel - I can get linear speedups by adding resources (cores)
2 cores is 2X as fast, etc
  - very few problems are like this
  - but most cloud workloads are!

Web applications are mostly built to be as embarassingly parallel as possible
 - state coordination happens in the database, and there we have to use other techniques

Hardest problem at web scale that is mostly "solved" (ish) is search
 - and even that is mostly parallelizable
    - crawling the web
    - indexing
 - key example: Google search
    - and they invented many of the technologies we'll discuss


What do single system operating systems look like?
 - Windows, Linux, MacOS, etc

Specifically, what is the unit of computation?
 - processes

One approach to making a distributed OS is just to distribute processes
 - older approaches tried this
 - but it doesn't work

Consider process p on machine A, want to move to machine B
What do you do with p's
 - open files?
 - network connections?
 - other processes it depends on?
   - shared memory?

UNIX/Windows processes were never meant to migrate between hosts, and it shows

elegant solution to this is Plan 9
 - makes UNIX run across a cluster of systems

But a Plan 9 type solution of distributing processes didn't allow for true scalability
 - didn't really support parallel workloads

So we got some very different stuff, as we'll see.

Don't distribute processes, distributed operating systems
 - a container is just a regular OS minus the kernel
 - a hardware virtual machine is an entire os that is abstracted so it can run
   on arbitrary hardware (and share it with another OS)

But then, how do we get all these individual OSs to work together?
 - orchestration (kubernetes)
 - specialized applications (gfs, mapreduce)