Notes
Spanner & Tensorflow
--------------------
Last two papers!
April 5th - class wrap-up discussion, exam review
April 10 - project presentations
Spanner - big, distributed SQL database (mostly)
- at Google
- compare with Bigtable, Dynamo (NoSQL systems)
- what is the difference in functionality?
- why does it matter?
- HOW?! what is the "neat trick"?
- has to do with time, but why?
- to what degree is Spanner a full relational database, like Postgres?
Tensorflow
- how does it compare with mapreduce?
- what type of operations does it handle?
- why are these operations important?
- how are they done in a scalable fashion?
- why isn't this more general?
AFTER DISCUSSION
Only presentations on April 10th, no class on April 12th
If you wish to volunteer to present on April 5th, please PM me
- time left for speakers on the 10th will depend on how many volunteer on the 5th
- may have "lightning talks" of 3-4 min
Thoughts on Spanner?
So why spanner?
- wanted to support more complex queries (i.e., ones involving multiple tables)
- programmers are used to using SQL
- and transactions
- great for updates without corrupting data (i.e., partial, incomplete updates)
SQL is the obvious choice for making a database
- compromised to support scalability
- how does Spanner do SQL & scalability?
True time is just a way to get really accurate time
- generally from GPS
GPS turns time information into location
- only works with super accurate clocks
- GPS satellites are just atomic clocks that broadcast their time
So anyway any computer with a GPS receiver has access to very accurate clocks
So put them in servers, they have accurate time, and now we can get an absolute ordering of events
Normally you can't use local clocks to determine relative order of events because of clock skew.
- but with really accurate clocks, you can!
Traditionally databases were the hardest part of a web app to scale
- but now we really can scale them as far as we want
Tensorflow
parallel computing is hard
- especially when communication is slow/expensive
Distributed OS is largely trying to hide the complexity of parallel computing
from apps
- provide the right abstractions
POSIX files, processes, they aren't right
But append-only files, containers, and specialized computation abstractions are
- mapreduce, BOINC were our first examples, required embarassingly parallel problems
- tensorflow is an effort to parallelize machine learning apps in a general way
- abstraction: dataflow
So why tensorflow?
- dataflow with tensors
- tensors are just multidimensional arrays