DistOS 2018F 2018-11-05: Difference between revisions
No edit summary |
|||
Line 9: | Line 9: | ||
Lecture: | Lecture: | ||
Why did they build big table: | Why did they build big table: Other apps other than a web crawler wanted to use the GFS….BigTable, wanted it to be fast, lots of data … multidimensional sorted map (not a database) | ||
The Ad people said that BigTable was dumb b/c they wanted to do queries to generate business reports….so they now have business reports. Every technology bears the fingerprints of the initial use cases….unix has it all over the place...weirdness everywhere. | The Ad people said that BigTable was dumb b/c they wanted to do queries to generate business reports….so they now have business reports. Every technology bears the fingerprints of the initial use cases….unix has it all over the place...weirdness everywhere. |
Revision as of 19:06, 6 November 2018
Readings
- Chang et al., "BigTable: A Distributed Storage System for Structured Data" (OSDI 2006)
- Corbett et al., "Spanner: Google’s Globally-Distributed Database" (OSDI 2012)
Notes
Lecture Notes:
Lecture:
Why did they build big table: Other apps other than a web crawler wanted to use the GFS….BigTable, wanted it to be fast, lots of data … multidimensional sorted map (not a database)
The Ad people said that BigTable was dumb b/c they wanted to do queries to generate business reports….so they now have business reports. Every technology bears the fingerprints of the initial use cases….unix has it all over the place...weirdness everywhere.
Here: what is Google determines the technology stack. How much code is at Google, source code repository is in the billions of lines of code….what they have done? Not only complex apps, the entire infrastructure for distributed OS, services and everything built internally mostly from scratch. Pioneers, they now have the legacy code problems. …. Opensource might have something better but they are locked in. At the forefront of building stuff at web internet scale….built it and know how to do it. Now less competitive advantage b/c lots of orgs know how to build things similar, yet better.
What’s big table built on-top of? GFS and Chubby. Why not build from scratch? Chubby is specialized, what it does, it does well. One thing to note: when look at papers, the Google techs fit together….building on top of other technologies. Advantages in the amount of code required to write.
Data structure, the SSTable, the crawler stores stuff in SSTables and then this was built on top to query the SSTable rather specialized tools. How to make database-like. Versioning the SSTables and then adding index semantics and localities (a mapping). Request stuff, how to index and find stuff.
Spanner was built on top of Colossus instead of GFS….key problem with GFS, batch-oriented...random access is not well optimized. Why Colossus? Don’t know. The systems being discussed are large systems, why this course is not a coding course. Big systems. What that means, papers published about big systems, things were glossed over. Don’t talk about everything, have to leave a lot out. Paper is saying how to solve the problem. How to implement SQL transactions on a global scale. How to implement an entire system, they don’t go through that.
Memtable (5.3): what is it? A log-structured file system. Writes, don’t overwrite things that have written directly. 5 versions of a file, which versions are the current one, the rest are old. Whatever is in memory, will dump again etc. but with a log structure, what you do is maintain new new new new new and then clean up the old stuff. Have a compaction thing. Take current stuff, put it at the end and the cleanup...this is an important pattern to see with of systems.
How does a normal file system work? Have blocks
Hard drives (mechanical): sequential read and writes are fast. While random, moving head between tracks is a slow operation. Density increases, faster at the same rpm speed.
How to speed up? Keep in ram and then periodically write to disk. Good for performance until you have a problem like a system loses power. Big risk, might lose some data but what if you corrupt the data structure. Corrupted, lose all the data on disk. FSCK on windows, chkdisk, all they do, go through and see if it is a clean file system. If in a dirty state, walking the file system and ensure consistency...checking metadata blocks. The times on modern drives is enormous if have to check the whole drive. RAID arrays, lots of time, can’t do this, this is bad.
So idea of the journaled FS. Set aside a portion of the disk where you do all your writes first….written sequentially….to do it fast. Now committed, if lose power, all you have to do is replay the journal. Great system when workload is dominated by reads. But if you have a lot of writes, this is bad b/c have doubled the writes. What is the solution? Eliminate everything except the journal. Just write records one after the other. Block number might be on disk 50 times, b/c have written multiple versions but then, you go back and later reclaim it.
So the longer you go without cleanup, without committing to disk, the faster the performance. New report supersedes the old report. 20 copied of A and one version of B...need to do a cleanup 1 copy for A, 1 for B in a pile and then throw the rest out. Big table does almost the same thing. Keeps writing new data and the old data might still be valid. Classic strategy when you have lots of writes and you want to avoid random access, using sequential writes instead.
So Spanner, they wanted a database with transactions. What is the hard part of a transaction? Have a consistent state. Have to have the same view globally, globally literally means the globe. Want to have a consistent view of the data. One place has seats, the other place has payments, if not consistent, selling payments for seats you don’t have. How does time help you solve that? Creates a concept of a before and after….what is the global order of events. That is what you need for global consistencies. In a distributed system, ordering is not consistent by default. So if you have to maintain a global ordering, you are screwed. A will happen before B and B will happen before A. If it is inconsistent, will have different versions of the data, which one came first etc. This is bad. By having synced clocks, and the skew between them, can assign a timestamp and compare what happened first. If there is any possibility that the ordering is ambiguous, they know it. The window of uncertainty until things settle down and then do the operation. Getting a global ordering of events by using accurate clocks. From a physics standpoint is bizarre. Relativity. Simultaneous events are undefined, depends on the frame of reference. Ordering. GPS, atomic clocks in space sending radio signals essentially.