DistOS 2018F 2018-11-05

Readings

Chang et al., "BigTable: A Distributed Storage System for Structured Data" (OSDI 2006)
Corbett et al., "Spanner: Google’s Globally-Distributed Database" (OSDI 2012)

Notes

Lecture Notes:

Lecture:

Why did they build big table: Other apps other than a web crawler wanted to use the GFS….BigTable, wanted it to be fast, lots of data … multidimensional sorted map (not a database)

The Ad people said that BigTable was dumb b/c they wanted to do queries to generate business reports….so they now have business reports. Every technology bears the fingerprints of the initial use cases….unix has it all over the place...weirdness everywhere.

Here: what is Google determines the technology stack. How much code is at Google, source code repository is in the billions of lines of code….what they have done? Not only complex apps, the entire infrastructure for distributed OS, services and everything built internally mostly from scratch. Pioneers, they now have the legacy code problems. …. Opensource might have something better but they are locked in. At the forefront of building stuff at web internet scale….built it and know how to do it. Now less competitive advantage b/c lots of orgs know how to build things similar, yet better.

What’s big table built on-top of? GFS and Chubby. Why not build from scratch? Chubby is specialized, what it does, it does well. One thing to note: when look at papers, the Google techs fit together….building on top of other technologies. Advantages in the amount of code required to write.

Data structure, the SSTable, the crawler stores stuff in SSTables and then this was built on top to query the SSTable rather specialized tools. How to make database-like. Versioning the SSTables and then adding index semantics and localities (a mapping). Request stuff, how to index and find stuff.

Spanner was built on top of Colossus instead of GFS….key problem with GFS, batch-oriented...random access is not well optimized. Why Colossus? Don’t know. The systems being discussed are large systems, why this course is not a coding course. Big systems. What that means, papers published about big systems, things were glossed over. Don’t talk about everything, have to leave a lot out. Paper is saying how to solve the problem. How to implement SQL transactions on a global scale. How to implement an entire system, they don’t go through that.

Memtable (5.3): what is it? A log-structured file system. Writes, don’t overwrite things that have written directly. 5 versions of a file, which versions are the current one, the rest are old. Whatever is in memory, will dump again etc. but with a log structure, what you do is maintain new new new new new and then clean up the old stuff. Have a compaction thing. Take current stuff, put it at the end and the cleanup...this is an important pattern to see with of systems.

How does a normal file system work? Have blocks Hard drives (mechanical): sequential read and writes are fast. While random, moving head between tracks is a slow operation. Density increases, faster at the same rpm speed.

How to speed up? Keep in ram and then periodically write to disk. Good for performance until you have a problem like a system loses power. Big risk, might lose some data but what if you corrupt the data structure. Corrupted, lose all the data on disk. FSCK on windows, chkdisk, all they do, go through and see if it is a clean file system. If in a dirty state, walking the file system and ensure consistency...checking metadata blocks. The times on modern drives is enormous if have to check the whole drive. RAID arrays, lots of time, can’t do this, this is bad.

So idea of the journaled FS. Set aside a portion of the disk where you do all your writes first….written sequentially….to do it fast. Now committed, if lose power, all you have to do is replay the journal. Great system when workload is dominated by reads. But if you have a lot of writes, this is bad b/c have doubled the writes. What is the solution? Eliminate everything except the journal. Just write records one after the other. Block number might be on disk 50 times, b/c have written multiple versions but then, you go back and later reclaim it.

So the longer you go without cleanup, without committing to disk, the faster the performance. New report supersedes the old report. 20 copied of A and one version of B...need to do a cleanup 1 copy for A, 1 for B in a pile and then throw the rest out. Big table does almost the same thing. Keeps writing new data and the old data might still be valid. Classic strategy when you have lots of writes and you want to avoid random access, using sequential writes instead.

So Spanner, they wanted a database with transactions. What is the hard part of a transaction? Have a consistent state. Have to have the same view globally, globally literally means the globe. Want to have a consistent view of the data. One place has seats, the other place has payments, if not consistent, selling payments for seats you don’t have. How does time help you solve that? Creates a concept of a before and after….what is the global order of events. That is what you need for global consistencies. In a distributed system, ordering is not consistent by default. So if you have to maintain a global ordering, you are screwed. A will happen before B and B will happen before A. If it is inconsistent, will have different versions of the data, which one came first etc. This is bad. By having synced clocks, and the skew between them, can assign a timestamp and compare what happened first. If there is any possibility that the ordering is ambiguous, they know it. The window of uncertainty until things settle down and then do the operation. Getting a global ordering of events by using accurate clocks. From a physics standpoint is bizarre. Relativity. Simultaneous events are undefined, depends on the frame of reference. Ordering. GPS, atomic clocks in space sending radio signals essentially.

More notes

Corporate view of BigTable, why did they build it? WebCrawling at first, then other applications used it. Need a system to hold a lot of data.
BigTable not a DB, its a sparse, persistent, distributed map.
Ad people didn't want BigTable, they wanted queries to generate reports, that's what SQL is used for often.
Every technology always bears the fingerprints of its initial uses.
Google has millions of lines of code, manage extremely complicated systems. Google has a problem with complexity. It was building things at the forefront of the web.
Generally a single organization figures out how to build something, it's messy, then others implement it better.
Google technologies fit together.
GFS is fundamentally batch oriented and streams data, not optimized for random reads and writes, perhaps that's why it was succeeded by Colossus.
These systems are MASSIVE!
Things get complicated when you write it, MODIFY IT, then have to query it again.
Memtable: looks kind of like a log structured file system. You never overwrite what you wrote previously, you always supersede it, newest version in RAM, older on disk. Later clean up the old ones.
Normal file system has blocks, some of these blocks are inodes, others will be data, inode will have pointers to data. Another block will be a directory which is basically a file that has a list of names to inodes and the inode points to actual data. Whenever you have to update a file you have to write everywhere, bouncing all over the place and this is slow on a spinning hard drive. Hard drive optimized for sequential writes and reads. Speed this up by caching in memory then periodically write. All good for performance until you lose power! Most recent stuff in memory is lost, not a big deal, but what if you corrupt data structure?? Use fsck to fix this but very slow! Fsck times on modern drives is enormous, in raid arrays it is hours or days - you just cant do this! People came up with Journaled file system: Set aside some portion of the disk which is the journal - this is where you do all your writes first, sequentially, thus you have to write twice to disk, but since its sequential its fast. If you lose power all you have to do is replay journal. Great system if workload dominated by reads, but if you have a higher proportion of writes then this is bad, as you double every write. SOLUTION: eliminate the block part and make the entire file system a journal, this makes it wasteful for space but you can go back and clean it up. The longer you go without cleanup the longer this cleanup process takes. Thus you periodically do a compaction phase. Basically you throw away older versions and keep most current. BigTable basically does this.
Spanner is a DATABASE, has TRANSACTIONS, transactions difficult because Spanner is global and you want a consistent view of your data. Need much better synchronization.
TrueTime creates the concept of before/after, need to know global order of events. In a distributed system ordering is not ordered by default since you have to copy and all that stuff.
If you want global consistency then you're basically screwed because it will always be slightly off. So by having synchronized clocks and measuring the skew between them you can go back later and definitely conclude the order.
Get global consistency through accurate times … kind of goes against physics and quantum mechanics since synchronous events undefined depending on frame of reference, or something. All systems are on the same planet so it works out!