DistOS 2023W 2023-04-03

Notes

Spanner & Tensorflow
--------------------

Last two papers!

April 5th - class wrap-up discussion, exam review
April 10 - project presentations

Spanner - big, distributed SQL database (mostly)
 - at Google
 - compare with Bigtable, Dynamo (NoSQL systems)
   - what is the difference in functionality?
   - why does it matter?
 - HOW?! what is the "neat trick"?
 - has to do with time, but why?
 - to what degree is Spanner a full relational database, like Postgres?

Tensorflow
 - how does it compare with mapreduce?
 - what type of operations does it handle?
 - why are these operations important?
 - how are they done in a scalable fashion?
 - why isn't this more general?


AFTER DISCUSSION

Only presentations on April 10th, no class on April 12th

If you wish to volunteer to present on April 5th, please PM me
 - time left for speakers on the 10th will depend on how many volunteer on the 5th
 - may have "lightning talks" of 3-4 min

Thoughts on Spanner?

So why spanner?
 - wanted to support more complex queries (i.e., ones involving multiple tables)
    - programmers are used to using SQL
 - and transactions
    - great for updates without corrupting data (i.e., partial, incomplete updates)

SQL is the obvious choice for making a database
 - compromised to support scalability
 - how does Spanner do SQL & scalability?

True time is just a way to get really accurate time
 - generally from GPS

GPS turns time information into location
 - only works with super accurate clocks
 - GPS satellites are just atomic clocks that broadcast their time

So anyway any computer with a GPS receiver has access to very accurate clocks

So put them in servers, they have accurate time, and now we can get an absolute ordering of events

Normally you can't use local clocks to determine relative order of events because of clock skew.
 - but with really accurate clocks, you can!

Traditionally databases were the hardest part of a web app to scale
 - but now we really can scale them as far as we want

Tensorflow

parallel computing is hard
 - especially when communication is slow/expensive

Distributed OS is largely trying to hide the complexity of parallel computing
from apps
 - provide the right abstractions

POSIX files, processes, they aren't right

But append-only files, containers, and specialized computation abstractions are
 - mapreduce, BOINC were our first examples, required embarassingly parallel problems
 - tensorflow is an effort to parallelize machine learning apps in a general way
 - abstraction: dataflow

So why tensorflow?
 - dataflow with tensors
 - tensors are just multidimensional arrays