DistOS 2021F 2021-12-09
Jump to navigation
Jump to search
Notes
Lecture 22 ---------- Final exam review - themes of the class - key papers & concepts - potential questions Format of final exam - 3 hours - essay questions, 3 or 4 out of more (haven't created final yet) - open book, open note, open internet - just, NO COLLABORATION - and please, no plagiarism - all words should be your own - wasn't an issue on the midterm Themes of the class - distributed OSs can't just copy single-system OSs because the abstractions don't scale - UNIX-like systems work well for individual systems - people know how to program them, so we want to keep them as much as possible - but to scale, we need to approach problems from different angles, make use of different abstractions Why don't single system OSs scale? Key assumptions are violated - process memory is strongly consistent and access to it is roughly uniform - low latency - files are consistent - system all works or is completely broken - partial failures not a concern - so, communication is consistent and reliable This all changes when systems are connected by networks - network is *unreliable* and insecure - lose packets or entire connections - individual hosts fail independently yet distributed application should keep running - has to, because otherwise nothing would work (You know video for all classes are available through brightspace right?) Note that older attempts at distributed OSs dealt with distribution but not with reliability - allowed for whole system to go down when individual ones did Later there was replication to enable reliability But only when we had jobs that needed to be bigger than one host did we really scale things - the web (Scientific computing has always needed lots of computing power, but it was a specialized discipline with very different kinds of production requirements.) So, what are the strategies to make a platform for distributed applications? - keep UNIX, but in the form of containers - ephemeral, often immutable hosts - orchestration to manage containers - distributed data stores (filesystems, key-value, relational databases) built on immutability as much as possible - record append files - immutable tablets - immutable objects/chunks - relax consistency as much as possible - where we need consistency, need special mechanisms - use consensus algorithms (Paxos, etc) as little as possible to avoid bottlenecks - use synchonized time/order of events - make use of domain-specific regularities to optimize - replicate data and metadata for reliability & performance - replication leads to consistency issues, which are solved through a combination of immutability and consensus - if you want performance, assume system is trusted - nodes can fail, but any bad behavior is strictly bounded - if you can't trust systems centrally because they are controlled by mutually non-trusting parties, you have to address the byzantine general's problem - and solutions tend to be expensive! - this is where you get cryptocurrencies and proof of work - these systems are best suited for embarassingly parallel tasks, e.g., ones that fit the mapreduce model or are serving independent customers concurrently - great for major online app/service providers, Facebook, Google, Amazon, etc - however, we can do computations that require more coordination - but we have to optimize every specific type separately (i.e., we can optimize for ML but it won't help with doing large-scale simulations) Single system OSs got us used to having a single infrastructure for everything - in a distributed world, the infrastructure has to mirror application requirements - except, maybe, when we do lots of engineering (think Ceph and Spanner) - and even there, we have to be careful, will need to do app-specific tweaking The best essays will - have a clear argument - logical arguments - many specific examples drawing upon multiple papers from the class - you can bring in other sources, but focus should be on what we covered Remember you're demostrating that you know the class material