DistOS 2023W 2023-01-11

Group Discussion

Today we'll be discussing the challenges in building distributed applications. To this end, we will start with the following design exercise. You'll be split into small groups to discuss, and then we'll meet together to discuss what you all found. Every group should take notes so you can write up a group report afterwards. You should also appoint someone to tell the class a 2 minute (or less) highlights from your discussion.

The Scenarios

You have 10,000 networked computers. Consider the following tasks:

A file server for storing lots of small files for many users
A file server for storing relatively few but enormous files that are accessed my many systems at the same time
A whole-planet weather simulation
A database for airline reservations
A social media app where individuals only communicate with a small number of friends
A social media app where users can subscribe to topics, groups, and individual feeds, potentially allowing for the sharing of information amongst all users at the same time.
(Make up your own tasks.)

At a high level, how would you architect a system to distribute these workloads over the available computers? Be sure to consider:

Would your architecture be robust to individual hosts crashing, being corrupted, or becoming malicious?
How performant is your solution? Will it scale to even more hosts?
Are some hosts more important than other hosts?

Note that you do not need to discuss all of the above; instead, use them as inspiration for your own discussion of distributed applications.

Notes

Lecture 2
---------

group comments
 - peer-to-peer
 - replication (how much?)
 - caching
 - partitioning data
 - partitioned networks (systems drop off)
 - routing
 - load balancing
 - fault tolerance & recovery
 - partial file transfers
 - sharding (geo, etc)
 - concurrent access
 - optimistic access (and conflict recovery)

What scenarios seemed to be the hardest?  Why?
 the weather simulation, because it is not "embarassingly parallel"

The sad thing about distributed systems is...we really can't do much unless the problem is mostly embarassingly parallel
 - that's the only situation that truly scales
 - you really need high bandwidth, *low latency* to do these kinds of problems

What is difficult in parallel problems is managing communication
 - because communication is necessary and expensive

low latency allows for rapid communication - because communication requires
back-and-forth interactions

Amdahl's law
 - performance is limited by part that cannot be parallized
 - if you make one component super fast/parallel, it will be held back by the rest of the system
 - to make a system high performance every part has to be optimized
   - and some components and interactions are much harder to speed up than others

For example, consider CPU design
 - why don't we have n-core systems, for n "big"?
   - we have them for GPUs?

GPUs are optimized for parallel processing
 - mostly embarassingly parallel, i.e., tasks that can be cleanly partitioned

CPUs are optimized for sequential processing first
 - highly branched code

Sequential code can only be parallelized with *hard* work
 - automated parallelization is possible to some degree,
   if we are willing to waste resources
   (see CPU speculative execution)

The main way computers get faster now is through parallelization
 - through increasingly clever ways
 - and being clever doesn't scale

When we go distributed, the situation is much worse
 - because of higher latency and increased possibility of errors