Soma-notes - User contributions [en]

DistOS 2018F 2018-10-29

2018-11-18T19:00:51Z

Markkaganovsky: /* Notes */

==Readings==

* [http://research.google.com/archive/gfs-sosp2003.pdf Sanjay Ghemawat et al., "The Google File System" (SOSP 2003)]
* [https://www.usenix.org/legacy/events/osdi06/tech/burrows.html Burrows, The Chubby Lock Service for Loosely-Coupled Distributed Systems (OSDI 2006)]

==Notes==
Lecture Notes:

Peer-to-peer file sharing
Napster -> classic silicon valley, business model that makes no sense
Napster said, make all music available but not going to actually make the music available, use other people’s machines. Napster maintained a central directory but files stored on individual computers.
People still wanted to exchange music files. Don’t have a centralized database of all the songs. DHTs are a technology...what is a hash table...give it a string and it gives you something else...where can I download this? So if implement a distributed hash table, no one system controlling the hash table...just use that and then you have to shutdown a bunch of nodes.

ISP throttling...now they poisoned it (record companies)....download something, what the heck is that?
Idea behind Tapestry.....DHT as a service....Overlay network, what is it? Network that sits on-top of another network. Have the internet, isn’t the internet good enough? Point is that you want a different topology than what you have. Network based on geographic and organizational boundaries.
Overlay, redo the tapology...send a message to neighbours, neighbours defined by an overlay network.
i.e. Tor...defines its own topology.
Facebook....social networks....send a message out, make a post, routed to friends, neighbours and they get it. Tapology of the social graph, who connects to who. Tapestry is an overlay network but does it ignore geography, no it makes use of it. Peers are nodes that are close by....the network are the systems running Tapestry but also identifies nearby nodes vs. Further distance nodes....network topology aware.
Why? In large scale, will have a lot of node additions and deletions. Peer to Peer was, lets do file sharing but that was not their goal...building distributed applications of some kind.
Pond...wanted something that provided a layer of messaging to their own nodes, that’s Tapestry .... table of hosts will not work....this provides the layer to find each other and send messages to each other in an efficient way. Lets build an infrastructure for sending messages, do we build apps on top of things like this today?

Keep an eye out for....DHTs will appear but their is a fundamental issue with DHT, the Limewire problem...they do very badly with untrusted nodes....can mess everyone else up by giving bad info to the network...can stop one or two nodes but attackers have significant resources like a botnet to attack your system.

Single Tapestry Node figure 6....the OS...nothing fancy, like the regular internet except they are maintaining state using a distributed hash table...
Botnets...if they hard-code the IP address, how the botnet gets taken down...one way IRC, google search...use social media like an Instagram account....comments on celebrity Instagram feeds....spies used to send messages with ads in a newspaper or number stations (ham radio).
Tapestry, trust issues b/c with trusted infrastructure there are better ways of doing it.

Ceph:
Ceph is crazy, very complicated .... Ceph is out there, people are building this but really? What is CRUSH? Lets assume you have data, know where to go to get the parts of the files...would have to send a lot of data and update metadata every time you made changes to the file so they said we are not going to do that. These are not blocks, they are objects...what did they mean? Basically variable length chunk of storage, not dividing into fixed or variable size...file is some number of objects in some sort of order. When open file, what objects does it exist in but does the metadata storage give the objects? No, an algorithm to generate the names of the objects.

Metadata....store in memory every file ....hot spots for metadata access....3 would be maxed out and part of the system sleeping so the tree is dynamically re-partitioned to respond to hot spots.

Can have OSDs in parallel so asking for a file....distributed among lots of nodes so high performance ....many many computers talking to many many computers.

POSIX compatible (Ceph) impressive...POSIX compatibility is painful on writes...need to coordinate (centralize writes but that is slow). Can tell Ceph to be lazy. Take home lesson from Ceph....(all trusted, POSIX in distributed OS, can do it but OMG the admin overhead).

Tapestry...take home lesson...centralize node (trust)

Compare GFS with Ceph and Chubby (politically correct FAT storage :P) with Tapestry

== More notes ==

<ul>
<li>Ceph</li>
<ul>
<li>Very fast performance</li>
<li>Sharp drop off in performance when more than 24 OSDs due to switch saturation.</li>
<ul>
<li>Ceph sending too many messages, leading to lots of packet loss on a switch.</li>
</ul>
<li>Ceph is very complicated</li>
<li>CRUSH: Generating algorithm to find storage</li>
<li>Realized that there are hotspots for metadata access</li>
<ul>
<li>Typically you statically partition the file tree, ceph dynamically repartitions the metadata tree.</li>
</ul>
<li>Can ask for objects in parallel, download a file in parallel</li>
<li>Basically posix compatible</li>
<ul>
<li>Easy to be compatible on reads, difficult on writes.</li>
</ul>
<li>Take home lesson: Posix compatibility in a really big distributed file system, you CAN do it, but admin overhead is ridiculous.</li>
<li>Compare Cepth to the Google File System, google is also very scalable but not as complicated.</li>
</ul>
</ul>
 
<ul>
<li>Tapestry</li>
<ul>
<li>Limited performance</li>
<li>First p2p is napster</li>
<ul>
<li>Classic silicon valley, business model that makes no sense</li>
<li>Use other peoples machines to make music available</li>
<li>Download music from another peers computers, napster maintained central directory.</li>
<li>File did not go through napster, they just pointed to it.</li>
<li>Sued out of existence</li>
</ul>
</ul>
</ul>
<ul>
<ul>
<ul>
<ul>
<li>SOLUTION: Dont have a centralized location service.</li>
</ul>
</ul>
<li>A DHT is you give it a string and it tells you where you can download it.</li>
<li>Record companies poisoned torrents?? Not true imo</li>
<li>Tapestry is DHT as a service</li>
<li>Point of overlay network is you want a different topology over what you have</li>
<ul>
<li>Don't want a geographic based network like the basic internet</li>
</ul>
<li>Overlay network completely redoes topology, neighbors are defined by the overlay network.</li>
<li>Facebook, or all social media, is an overlay network, your posts are routed to all friends.</li>
<li>Tapestry does not ignore underlying physical topology, still takes locality into consideration.</li>
<li>Why?</li>
<li>Tapestry, in a sense, is exactly what this course is about</li>
<ul>
<li>Let's build this infrastructure for distributed applications.</li>
<li>We do not</li>
</ul>
<li>KEEP AN EYE OUT FOR: One fundamental issue with DHTs</li>
<ul>
<li>Limewire problem: Do very bad with untrusted nodes.</li>
</ul>
<li>Have to bootstrap it and find everyone.</li>
<li>How do botnets work? They use a legitimate service, like Twitter, Facebook, celebrity twitter feed comments.</li>
<ul>
<li>Spies used to send messages through ads</li>
</ul>
<li>Probably a trust issue as to why it isn't used</li>
<li>Take home lesson: Here's this thing for distributed applications, but trust issues.</li>
<li>Contrast Tapestry with Chubby</li>
</ul>
</ul>

DistOS 2018F 2018-10-31

2018-11-18T18:59:40Z

Markkaganovsky: /* Notes */

==Readings==

* [https://www.usenix.org/conference/osdi12/technical-sessions/presentation/corbett Corbett et al., "Spanner: Google’s Globally-Distributed Database" (OSDI 2012)]
* [http://research.google.com/archive/bigtable-osdi06.pdf Chang et al., "BigTable: A Distributed Storage System for Structured Data" (OSDI 2006)]

==Notes==
GFS: Why do they want to store stuff?

What is it being used for?

The search engine, a copy of the web, for the crawler and the indices built from the crawler. Motivated the design. Bring in lots of data and do something with it. Crawler going and downloading everything it can file. What kind of file system? Lots of small files? No, not efficient. Had an atomic record append. Find things and put together a bunch of pages, whole data structure...a record, an application object is what did the crawler find? Atomic append, got it, saved, done. When appending it, that matters b/c it is not a byte stream, it is an application object appended one after another.
If crawling along, try to do a write, problem with the write....do it again and try a few times...how many times did you write it? Got written at least once. Record append, no garrantee that when you write something at the end, application level has to detect duplicates which might be done multiple times...makes sense in the context of a very large web crawler. Enormous files with lots of the internet in them, in one file. Large files. Inform the design of it, what is the chunk size, 64MB (designed for large web crawler, want to save everything). Lots of files and metadata...no, important b/c they did a lot of work to minimize metadata by design choice....One master server stored all metadata and you want it to be fast, cannot be the bottleneck so...loaded up with RAM etc. But run from memory, metadata lookup and then the chunk server.
Ceph has an entire cluster, dynamically allocate metadata....GFS has one server with replicas and that’s it. Seems dumb, why? Simpler, ridiculously simple...not a general purpose file system, not POSIX compatible just developed for a web crawler. 64MB and one metadata server...optimized for atomic append operations on objects that may be replicated. Why they don’t use GFS now, optimized for batch processing...build and index and make available to the search engine....but now, oriented completely towards fresh data. But GFS is an interesting design from perspective of specialized storage, design for a specific application....lots of trade offs, high performance but limited functionality. If doing a read, what are you doing? What is the process. Ask the master first what chunk you are looking for....great here is the info on the chunks....read a byte range, give the chunks then the client gets the data as fast as it can digest it. The master server is not involved most of the time. Course grain nature of the chunks....64MB, the number of data that has to be transmitted, amplification factor...leverage to be able to do relative scaling.

Writes are more complicated .... three is the magic number (3 replicas)...send out, wait for ack...might need to re-transmit until done....how big of a buffer do you need? A good size, several GB of RAM. Crawling application, use a lot of systems all writing to the same file, order in which they will be appended...who knows but order doesn’t matter...why it is a weird kind of file.

Chubby, what is chubby? A file system. Hierarchical key value store, can you store big files? No, 256k....but it’s a file system....why not use a regular file system, why chubby? Consistency, consistent distributed file system....read and write will look the same at any time. Readers and writers will see the same view. Previously, not completely consistent....b/c going to use it for locks...lock files are fine on an individual system but here, distributed file system that is completely consistent to use for locking....not easy, kinda crazy. They use Paxos....same state for everyone, how many failures can you tolerate.....5 b/c the Paxos algorithm is 2n+1 so if want to tolerate two failures, should have 5 machines....high level of reliability.
GFS when it wants to elect a master at a cell....go to chubby, find out who the master is and start talking to the master....if master is down or not responding...ask chubby, elects a new master....reliable backbone to elect who is in-charge....when master fails, switch over and transmit state to everyone quickly....chubby is how it was done.

Paxos does not scale...5 servers, go to 50...NO, chatty protocol so...what you need consistency for, can get it but must pay...so the rest of the system must be less consistent. If force consistency on everything all the time, never get performance. Paxos is the algorithm chubby used to coordinate state in the file system. Managing uncertainty....need the same state...paxos is a data structure, updates to the data structure....could always ask the current value from chubby and get the same value from any of the nodes...might have a delay...don’t know yet...wait and then when gives answer, is the correct answer.

Sequencer in Chubby: consensus....what is the ordering the states? Who goes first, second third and fourth....so if enforce ordering, will pay a price. Can do this with a lock generation number. Ordering, temporal ordering, is more difficult than consensus of state.

Hadoop ZooKeeper...the things people built after Google published...a whole set of technologies....not identical to GFS but inspired by GFS

How does Hadoop compare to AWS....amazon has own version of this stuff. Can go to Amazon and tell them to give you a bunch of Vms and can deploy Hadoop.....what AWS does, they can do it for you. What they offer S3 (cheapest storage)

== More notes ==

<ul>
<li>GFS
<ul>
<li>Google doesn't use GFS anymore, use something newer</li>
<li>Google needed to store lots of sequential data for their crawler and indices. Downloading everything and storing it.</li>
<li>Atomic record append operation, crawler download webpage, then just stick it on with record append.
<ul>
<li>No guarantee that only written once. Application level has to verify uniqueness, makes perfect sense in the context of giant web crawler.</li>
</ul>
</li>
<li>Enormous files with huge parts of internet.
<ul>
<li>Reflected in 64MB chunk size</li>
<li>Not a lot of metadata, not a lot of files, allows you to store it in 1 metadata server. Can load it up with run and will run mostly from memory.</li>
</ul>
</li>
<li>FAR, FAR, simpler than Ceph, ceph had many metadata servers</li>
<li>SIMPLICITY: Need something to store results of web crawler.</li>
<li>GFS optimized for batch processing, but now need fresh data.</li>
<li>Specialized design allows you to make more tradeoffs, high performance and simple</li>
<li>If you're doing a read for a file you ask master what chunks its stored in and master server gives you a big thing of chunks.</li>
<li>Optimized for appends, dont expect things to change.</li>
<li>Writes much more complicated, gonna be slow...something something</li>
<li>Crawling application has a bunch of systems writing to a single file, order doesn't matter.</li>
</ul>
</li>
<li>Chubby
<ul>
<li>GFS uses chubby for master election</li>
<li>Hierarchical key value store</li>
<li>Key idea is consistent distributed file system, consistent at a fine level of granularity, its gonna look the same at any time, all readers and writers will see the same view, nothing else in the course is as consistent.</li>
<li>Idea of Chubby is you want a completely consistent distributed file system so you can use it for locking</li>
<li>Use Paxos: an entire research area aimed on distributed consensus.
<ul>
<li>You have distributed nodes, you want the same state, how many failures can you tolerate.</li>
<li>Typical chubby cluster is 5 node: 2n+1, if you want to tolerate 2 failures then 2*2+1 = 5</li>
<li>Paxos does not scale very well, protocol is too chatty, can not force consistency on everything all the time, will not get performance.</li>
<li>“Ok everyone im writing to this, wait … ok i'm done now let's go”</li>
</ul>
</li>
<li>Chubby provides reliable backplane for saying whos in charge - just want to know who the master is. Nodes ask chubby who the master node is and if they are still alive, can become the new master this way.</li>
</ul>
</li>
<li>Hadoop is public implementation of GFS and Chubby started by Yahoo, then Facebook now many, and implementation of this.
<ul>
<li>Apache zookeeper is basically like chubby.</li>
<li>Apache Hadoop Distributed File System</li>
<li>Relation to AWS: Amazon has their own version of everything, does everything for you, AWS provides higher level infrastructure.
<ul>
<li>S3 is a data store, very cheap, NFS or block storage like file system is far more expensive. You do not know how S3 operates.</li>
</ul>
</li>
</ul>
</li>
</ul>

DistOS 2018F 2018-11-05

2018-11-18T18:58:28Z

Markkaganovsky: /* Notes */

==Readings==

* [http://research.google.com/archive/bigtable-osdi06.pdf Chang et al., "BigTable: A Distributed Storage System for Structured Data" (OSDI 2006)]
* [https://www.usenix.org/conference/osdi12/technical-sessions/presentation/corbett Corbett et al., "Spanner: Google’s Globally-Distributed Database" (OSDI 2012)]

==Notes==
Lecture Notes:

Lecture:

Why did they build big table: Other apps other than a web crawler wanted to use the GFS….BigTable, wanted it to be fast, lots of data … multidimensional sorted map (not a database)

The Ad people said that BigTable was dumb b/c they wanted to do queries to generate business reports….so they now have business reports. Every technology bears the fingerprints of the initial use cases….unix has it all over the place...weirdness everywhere.

Here: what is Google determines the technology stack. How much code is at Google, source code repository is in the billions of lines of code….what they have done? Not only complex apps, the entire infrastructure for distributed OS, services and everything built internally mostly from scratch. Pioneers, they now have the legacy code problems. …. Opensource might have something better but they are locked in. At the forefront of building stuff at web internet scale….built it and know how to do it. Now less competitive advantage b/c lots of orgs know how to build things similar, yet better.

What’s big table built on-top of? GFS and Chubby. Why not build from scratch? Chubby is specialized, what it does, it does well. One thing to note: when look at papers, the Google techs fit together….building on top of other technologies. Advantages in the amount of code required to write.

Data structure, the SSTable, the crawler stores stuff in SSTables and then this was built on top to query the SSTable rather specialized tools. How to make database-like. Versioning the SSTables and then adding index semantics and localities (a mapping).
Request stuff, how to index and find stuff.

Spanner was built on top of Colossus instead of GFS….key problem with GFS, batch-oriented...random access is not well optimized. Why Colossus? Don’t know. The systems being discussed are large systems, why this course is not a coding course. Big systems. What that means, papers published about big systems, things were glossed over. Don’t talk about everything, have to leave a lot out. Paper is saying how to solve the problem. How to implement SQL transactions on a global scale. How to implement an entire system, they don’t go through that.

Memtable (5.3): what is it? A log-structured file system. Writes, don’t overwrite things that have written directly. 5 versions of a file, which versions are the current one, the rest are old. Whatever is in memory, will dump again etc. but with a log structure, what you do is maintain new new new new new and then clean up the old stuff. Have a compaction thing. Take current stuff, put it at the end and the cleanup...this is an important pattern to see with of systems.

How does a normal file system work? Have blocks
Hard drives (mechanical): sequential read and writes are fast. While random, moving head between tracks is a slow operation. Density increases, faster at the same rpm speed.

How to speed up? Keep in ram and then periodically write to disk. Good for performance until you have a problem like a system loses power. Big risk, might lose some data but what if you corrupt the data structure. Corrupted, lose all the data on disk. FSCK on windows, chkdisk, all they do, go through and see if it is a clean file system. If in a dirty state, walking the file system and ensure consistency...checking metadata blocks. The times on modern drives is enormous if have to check the whole drive. RAID arrays, lots of time, can’t do this, this is bad.

So idea of the journaled FS. Set aside a portion of the disk where you do all your writes first….written sequentially….to do it fast. Now committed, if lose power, all you have to do is replay the journal. Great system when workload is dominated by reads. But if you have a lot of writes, this is bad b/c have doubled the writes. What is the solution? Eliminate everything except the journal. Just write records one after the other. Block number might be on disk 50 times, b/c have written multiple versions but then, you go back and later reclaim it.

So the longer you go without cleanup, without committing to disk, the faster the performance. New report supersedes the old report. 20 copied of A and one version of B...need to do a cleanup 1 copy for A, 1 for B in a pile and then throw the rest out. Big table does almost the same thing. Keeps writing new data and the old data might still be valid. Classic strategy when you have lots of writes and you want to avoid random access, using sequential writes instead.

So Spanner, they wanted a database with transactions. What is the hard part of a transaction? Have a consistent state. Have to have the same view globally, globally literally means the globe. Want to have a consistent view of the data. One place has seats, the other place has payments, if not consistent, selling payments for seats you don’t have. How does time help you solve that? Creates a concept of a before and after….what is the global order of events. That is what you need for global consistencies. In a distributed system, ordering is not consistent by default. So if you have to maintain a global ordering, you are screwed. A will happen before B and B will happen before A. If it is inconsistent, will have different versions of the data, which one came first etc. This is bad. By having synced clocks, and the skew between them, can assign a timestamp and compare what happened first. If there is any possibility that the ordering is ambiguous, they know it. The window of uncertainty until things settle down and then do the operation. Getting a global ordering of events by using accurate clocks. From a physics standpoint is bizarre. Relativity. Simultaneous events are undefined, depends on the frame of reference. Ordering. GPS, atomic clocks in space sending radio signals essentially.

== More notes ==

<ul>
<li style="font-weight: 400;">Corporate view of BigTable, why did they build it? WebCrawling at first, then other applications used it. Need a system to hold a lot of data.</li>
<li style="font-weight: 400;">BigTable not a DB, its a sparse, persistent, distributed map.</li>
<li style="font-weight: 400;">Ad people didn't want BigTable, they wanted queries to generate reports, that's what SQL is used for often.</li>
<li style="font-weight: 400;">Every technology always bears the fingerprints of its initial uses.</li>
<li style="font-weight: 400;">Google has millions of lines of code, manage extremely complicated systems. Google has a problem with complexity. It was building things at the forefront of the web.</li>
<li style="font-weight: 400;">Generally a single organization figures out how to build something, it's messy, then others implement it better.</li>
<li style="font-weight: 400;">Google technologies fit together.</li>
<li style="font-weight: 400;">GFS is fundamentally batch oriented and streams data, not optimized for random reads and writes, perhaps that's why it was succeeded by Colossus.</li>
<li style="font-weight: 400;">These systems are MASSIVE!</li>
<li style="font-weight: 400;">Things get complicated when you write it, MODIFY IT, then have to query it again.</li>
<li style="font-weight: 400;">Memtable: looks kind of like a log structured file system. You never overwrite what you wrote previously, you always supersede it, newest version in RAM, older on disk. Later clean up the old ones.</li>
<li style="font-weight: 400;">Normal file system has blocks, some of these blocks are inodes, others will be data, inode will have pointers to data. Another block will be a directory which is basically a file that has a list of names to inodes and the inode points to actual data. Whenever you have to update a file you have to write everywhere, bouncing all over the place and this is slow on a spinning hard drive. Hard drive optimized for sequential writes and reads. Speed this up by caching in memory then periodically write. All good for performance until you lose power! Most recent stuff in memory is lost, not a big deal, but what if you corrupt data structure?? Use fsck to fix this but very slow! Fsck times on modern drives is enormous, in raid arrays it is hours or days - you just cant do this! People came up with Journaled file system: Set aside some portion of the disk which is the journal - this is where you do all your writes first, sequentially, thus you have to write twice to disk, but since its sequential its fast. If you lose power all you have to do is replay journal. Great system if workload dominated by reads, but if you have a higher proportion of writes then this is bad, as you double every write. SOLUTION: eliminate the block part and make the entire file system a journal, this makes it wasteful for space but you can go back and clean it up. The longer you go without cleanup the longer this cleanup process takes. Thus you periodically do a compaction phase. Basically you throw away older versions and keep most current. BigTable basically does this.</li>
<li style="font-weight: 400;">Spanner is a DATABASE, has TRANSACTIONS, transactions difficult because Spanner is global and you want a consistent view of your data. Need much better synchronization.</li>
<li style="font-weight: 400;">TrueTime creates the concept of before/after, need to know global order of events. In a distributed system ordering is not ordered by default since you have to copy and all that stuff.</li>
<li style="font-weight: 400;">If you want global consistency then you're basically screwed because it will always be slightly off. So by having synchronized clocks and measuring the skew between them you can go back later and definitely conclude the order.</li>
<li style="font-weight: 400;">Get global consistency through accurate times … kind of goes against physics and quantum mechanics since synchronous events undefined depending on frame of reference, or something. All systems are on the same planet so it works out!</li>
</ul>

DistOS 2018F 2018-11-07

2018-11-18T18:57:22Z

Markkaganovsky: /* Notes */

==Readings==

* [http://research.google.com/archive/mapreduce.html Dean & Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" (OSDI 2004)]
* [https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi Martin Abadi et al., "TensorFlow: A System for Large-Scale Machine Learning" (OSDI 2016)]

==Notes==
In-Class Lecture notes:

MapReduce:

Parallel computations
Some things are inherently serial so, with computation, some things will take the same time regardless of cores or processors.
Palatalization hard, some amount of coordination between the systems to allow them to work together.
Systems that scale to large number of cores, systems. The ones that minimize coordination succeed.
For instance, GFS, they did a lot of things to reduce coordination to allow it to scale....does cause a bit of a mess but to clean up the mess would slow things down.
Models of computation that take advantage of these systems, MapReduce is the cleanest analysis. What sort of analysis do they talk about doing. Indexing, counting across things, grep....a search engine is just a big grep.
Why not run grep individually on all the computers, why do you need the framework?
Coordinating and consolidating the results of the machine. All MapReduce is, pieces of data computations done and then combine together.
Function programming aspect....stateless...don’t maintain state. The state of variables does not change (only binding can change). In a parallel system, coordinating maintaining state. If no state, don’t need to coordinate. Map & Reduce...if stateful, might have to undo computations if they mess up, side-affects. Could not run the code over and over again on the same data. If made purely functional, the answer will be the same no-matter how many times it has run. Fault tolerance. Duplicating work to make sure the overall computation finishes on time. Run multiple times to ensure the same answers. Aspects of fault tolerance show up when not doing functional programming. 10000 machines to run a job, computers can fail during computation and you don’t care b/c we don’t care about maintaining state....master server gathers together, combining function to get results. When you do a search, everything you get has been pre-computed.

MapReduce is an ad-hoc kinda that is not used anymore due to fundamental limitations but is the correct paradigm. Limitation: problem has to fit into the map and reduce...only coordination on the reduce side, no coordination on the map side. TensorFlow is not embarrassingly parallel at all. Have to worry about interactions. What is the model to break them up. Unit is a Tensor. A multidimensional array. Lot of mathematics that are defined on multidimensional arrays. TensorFlow will breakup computation into tensors. Functional programming in a sense but in a different context. Doing mathematics, if can define the math at a high level, the underlying system can refactor it at a high level. How it fits together, not our problem....divided up into Tensors and math ops on Tensors and parallelism.

AI is primary task for TensorFlow. Why not do other things like this? Large-scale simulations, other use case for clusters. Are maintaining large amount of state, in those systems, do partitioning based on state.
Game world...game parallelization. Divide the world into different places, going from one world to another could be going from one server to another.

Why do we need insane computations for doing AI....optimizing it and AI algorithms today are very stupid. Need lots and lots of data with lots and lots of samples. The learning is not advanced the pattern abstraction is almost brute force. How many miles AI driven cars have driven...to train a single model. How many miles does a person have to drive to learn how to drive.....driving is a task based on a world model we have been building our whole lives. Self driving cars, have to teach them how the world works...don’t have a world model. Can you learn how the world works from driving a car.

Check-Pointing in MapReduce...why not?....failures that happen during the computation don’t matter, just restart and do that part....but TensorFlow cares a lot more about reliability thing.
The master, how long does a MapReduce job run? Not long, a few minutes to an hour tops. Parallelism has made everything quick. What kind of neural net is TensorFlow training? Facial recognition, recommendations, translations...models running for a long time with TBs of input for days or weeks...lots of state being created of the entire neural networks and save along the line so implemented check-pointing. What do you care about saving? b/c saving all the state is expensive...save every hour and results that are good when neural network is doing well.

Genetic algorithms...the next wave in AI.

Fitness function, some layers are garbage, some have good fitness and then combine them together (using a operation called crossover)...mutation, flip some bits to reproduce solutions and then do it again...recompute fitness...an abstraction of natural evolution.

== More notes ==

<ul>
<li>Systems that scale well minimize coordination. Recall that lots of coordination means lots of serializability and less performance.
<ul>
<li>in GFS we don't know how many times appended</li>
</ul>
</li>
<li>MapReduce is for embarrassingly parallel architectures.</li>
<li>We are interested in MODELS OF COMPUTATION that take advantage of massive parallelization.</li>
<li>MapReduce is for indexing, for counting. Search engine is like a big grep, but why not use Grep? Because you need coordination to take advantage of thousands of computers.</li>
<li>Functional programming is stateless. If x=5, then x always equals 5, you can only create a new x. We always pay for coordinating state.</li>
<li>Because MapReduce has no state you don't have to worry about side effects, if a machine flakes just do it on another, no state to manage. Inherent fault tolerance!</li>
<li>Larger scale programming will probably be done with functional programming. MapReduce allows you to run a job on thousands of computers.</li>
<li>count # of times argentina occurs in a lot of pages, voila 50 million times it appears. But when you query google you are not starting a mapreduce job, everything has been precompued.</li>
<li>MapReduce has fundamental limitations, problem has to fit into MapReduce paradigm. Cant do equations between a Mapping.
<ul>
<li>TensorFlow allows you to do this, TensorFlow is NOT for embarrassingly parallel computation at all.</li>
</ul>
</li>
<li>Fundamental data in TensorFlow is a multidimensional array, matrix is a 2d array, tensor is multidimensional.</li>
<li>Tensors can be enormous, so TensorFlow will break up the tensor and combine them together, can do this through mathematical equivalences.</li>
<li>A lot of AI can be represented as Tensor operations. Large scale simulations do not map nicely to Tensors. In large scale computations you need state. Need some partitioning based on space, ex. In weather simulation you divide into cubes of 1km or something like that. Same thing in a game world how you go from 1 server to another when you change place.</li>
<li>AI algorithms today are stupid and require lots of training, learning is not sophisticated. Think about how many miles AI cars have driven. People only had to drive a few. Humans have been building a world model since children, we are just slightly adding to it with a car. You can not learn the way the world works by driving a car. Toddler learns physics as they grow up. You need a lot of data and computation.</li>
<li>No checkpoint in MapReduce since you can just redo it and jobs are short.</li>
<li>In TensorFlow you require checkpoints. Models in TensorFlow run for days, lots of state being created. Checkpoints implemented, under control of application, dont want to save too often, just keep the neural nets that do well.</li>
<li>Anil believes that future of AI is in Genetic algorithms.</li>
<li>Neural networks:
<ul>
<li>In neural networks you have a lot of weights, and you backpropagate to adjust weights.</li>
<li>DeepLearning is with lots of layers.</li>
<li>Need to encode data into the input layers.</li>
</ul>
</li>
<li>A genetic algorithm starts with random population and fitness function. Successful ones get combined together and mutation sticked to it, then you further refine it.</li>
<li>Most of AI is searching through space.</li>
</ul>

DistOS 2018F 2018-11-12

2018-11-18T18:56:04Z

Markkaganovsky: /* Notes */

==Readings==

* [http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf DeCandia et al., "Dynamo: Amazon’s Highly Available Key-value Store" (SOSP 2007)]
* [http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf Lakshman & Malik, "Cassandra - A Decentralized Structured Storage System" (LADIS 2009)]

==Notes==
Lecture:

Dynamo:

What is Dynamo the solution too? Shopping carts and web sessions.

When you put something in shopping cart, it stays, even months later. Shopping carts matter but have weird priorities compared to the past. First care about availability. Session for it, and if not, give a new session. And this matters at i.e. Christmas.

This does not look like how Google builds things, what is different?
What does big table depend on? Gfs and chubby.
Dynamo is a standalone service to solve the problem. Why doesn’t Google do that? It is about philosophy. The two pizza rule...teams to be no bigger than could be fed with two pizzas. A bunch of small teams, they want each team to be autonomous. They don’t want hierarchy of management. Relatively small teams, need to be able to work on a self contained project that does not tightly couple you to another team. Service oriented architecture. Other folks inside Amazon when use your service, using an API that can be exposed to the world.
Contrasted with Google, designed to be trusted in a trusted environment that is not exposed to the world. Denial or service attacks have to be considered from the beginning from Amazon than Google. Amazon tries to build a platform, Google tries to provide a product.
Microsoft does this kind of, internal APIs.
AWS, this is the API we use, don’t get a different one. If you don’t like Amazon’s AWS, go do your own thing.
Number of services offered by AWS is insane and redundancy among them with overlapping capabilities. Kind of Darwinian. AWS webpage is a thin layer over internal APIs...org chart...look at a product, what they are selling is a reflection of the way the organization is organized. Amazon’s is very flat, Google is not. Google Hierarchical. Not that one is better than the other, just different. Amazon is winning the platform race.

Facebook Cassandra:

Inbox search, messages to search. What they developed....why is this a funny thing. Google is based on searching...Facebook, no. Social graph, Facebook wall...now need to allow people to search. What is the search optimized for? Writes. Want to have inbox search running consistently. How often queried vs. Writes. Optimized for the writes not for the queries...mostly data dumps, occasionally searching. Log structure almost output, everything into memory for index and searching. Someone’s inbox can probably fit into a node. Not a big data problem in the sense of a google search. ...limited search...load into memory and do a quick search of it there, which is very different. Why does that matter? Solving a specialized problem....don’t need a relational database, no schema, free-form search.

What is similar in how they are implemented. Do these use Paxos? Cassandra gossip protocol....a ring, n-1 nodes. Scuttle-bud .... elects a leader. All the master does is what the replicas does...problem partitioning. This is where consistent hashing comes in. Rows grows or shrinks, don’t have to rehash everything. Otherwise, have to reorganize the data. Consistent hashing, only a fraction of it. Seeing how the same ideas are getting applied again and again. Gossip protocols, have not seen that. Paxos, small number of nodes that maintain state, consistent view of that which is replicated in a hierarchy. Harder to get strict guarantees out of gossip kind of thing but better performance to handle incoming data rather than consistency .... Cassandra consistency model can change. Ring structures in gossip protocols go together. Talk to neighbours and a few more neighbours for communication is on the ring. Redundant but not too redundant.

Diff between inbox searching and shopping cart...Dynamo was a key value store, no search. Cassandra needs to do search on a node....optimized for writes but when do a search, sucked into a node’s memory to do searching. Not doing a lot of indexing, just metadata for where messages are stored. If doing relatively infrequent searches, might as well load it into memory....how often will they search their inbox? Otherwise, if infrequent, why bother generating an index. Specialized systems for solving specific problems. How much space there is in the design when you make specialized solutions. General solutions are either limited or broke. Compared to ceph, this is simpler but optimized for a specific use case. Always in comp sci, want to make a general solution but in experience, only way to get useful systems for many different scenarios is to use them in many different scenarios. Why large organizations have successes at solving internal problems and then export them. Google had the scale, had the resources but, were not in the business of selling it (internal use only). Amazon did the opposite. Google is catching up to AWS. Don’t have the correct culture, requires a major culture shift.

Seeing patterns and how infrastructure is developing. The goal, if you see a new paper...how if fits in with the rest...if you see a design, let’s build this? What problem are you solving...can go out there and have an idea of the systems that are there and what is the problem that is being solved there....rather than building your own system. Solve a problem that doesn’t fit, maybe not an off the shelf solution but need to be able to recognize it...availability, consistency....get away from this is the perfect system or best solution. What is right for the problem you are trying to solve....what infrastructure do you have? Do you want to use that infrastructure?

== More notes ==
<ul>
<li>Dynamo: Solution to always available shopping carts.
<ul>
<li>First care about availability. They want you to be able to add to your shopping cart no matter what. Matters because: Christmas</li>
<li>BigTable needs GFS and Chubby, Dynamo depends on nothing, it is a standalone service.
<ul>
<li>2 pizza rule at Amazon: Amazon doesn't want teams to be any bigger than can be fed with 2 pizzas. Don't want to build big systems. Each team needs to work on a self contained project. Amazon took on service-oriented architecture. Amazon needs to take in DOS attack. Amazon is trying to build a platform, Google tries to build products. AWS is the API that Amazon uses. Google makes more tightly coupled systems, Amazon loosely coupled.</li>
<li>Lots of redundancy between AWS systems. It is darwinian - they just put things out there.</li>
<li>Organizations sell their org chart - it is a reflection of how things work internally. Amazon’s is far flatter, Google is more hierarchical.</li>
</ul>
</li>
<li>Cassandra: Inbox search problem. You have a lot of messages and you want to be able to search it.
<ul>
<li>Optimized for writes, not for queries. Looks like log structured file system. Searching done in memory. Much smaller records than Google - Google had the entire web, facebook just has inbox.</li>
<li>Consistent Hashing: If you change size of hash buckets, don't have to rehash everything</li>
</ul>
</li>
<li>Gossip protocols vs Paxos: Replication in gossiping is done using p2p gossiping, like an infection. Paxos you have a few small number of nodes with lots of replication.</li>
<li>Ring structures and gossip protocols tend to go together.</li>
<li>In cassandra you find your inbox first and then search on it. Dynamo is just a key-value store.
<ul>
<li>Not a lot of indexing in Cassandra. Few searches, just load it into memory and do a linear search. Rarely do people search their inbox, don't need to index it like google does.</li>
</ul>
</li>
<li>These later systems are simpler, unlike Ceph which is a general solution and very complicated. General solutions that solve a problem completely rarely work.</li>
<li>Amazon has been very good at exporting their services. Google doesn't have the right culture to catch up to Amazon.</li>
<li>Big idea of this course is to see what's out there and try to reuse other peoples solutions. No such thing as the best solution, just what's best for your particular problem.</li>
</ul>
</li>
</ul>

DistOS 2018F 2018-11-14

2018-11-18T18:54:26Z

Markkaganovsky: /* Notes */

==Readings==

* [http://static.usenix.org/legacy/events/osdi10/tech/full_papers/Beaver.pdf Beaver et al., "Finding a needle in Haystack: Facebook’s photo storage" (OSDI 2010)]
* [https://www.usenix.org/conference/osdi14/technical-sessions/presentation/muralidhar Muralidhar et al., "f4: Facebook's Warm BLOB Storage System" (OSDI 2014)]

==Notes==
Lecture Nov 14: Haystack & F4

CDN (Canadian Dairy Network), when talking about Haystack, talking about CDNs. What is a CDN...Content Distribution Network. Large-scale cache for data. The idea is you have servers replicated around the world. Servers close to people and the things people want...instead of going to main server, go to local server and get a copy from there. Pioneer Akamai .... folks from MIT....models that became content distribution networks necessary for large scale systems. Reduces latency of page loads. Do not serve entire website, CDNs are bad at serving dynamic content such as email...a web app is not in a CDN b/c it makes no sense...no-one should be asking for the same email, should not be replicated....what you see is specific to you does not make sense. Only makes sense if replicated across multiple page views. Code on client is going to have to get custom data, that is where a CDN does not work. See Figure 3...to CDN and to Haystack storage. Original solution....photos are the problem...original solution was NFS....this sucks but why? Bad performance....too many disk accesses why? Metadata...having to go access the iNode and then the actual contents of the file was too much...for a normal file system, of course you have to separate the metadata from the data....metadata has different access patterns, did not make sense for Facebook...didn’t want separate reads for both...why can you get away with merging metadata with data...the needle....figure 5....metadata and data are intertwined....why can they get away with the format....the data is immutable. The game changes with how you deal with metadata or data b/c they don’t change...Where is the photo, find the photo and then everything about it in one place. Keep track of headers then read everything about it in one go. Just need the offset, do not need a separate iNode...no pointer to iNode, just need big set of files where it is and the offset of the photo...reduce file operations and thus increase performance. It is just realizing that your data is immutable. Separating metadata and data together, fast access pattern.

Have a photo name, gets you speed. Has to be immutable, fast is not good if it is not durable. We need protection and redundancy. Don’t store every photo once, store multiple times. The indexes are in memory.

F4 is built on the Hadoop and Haystack uses FX

Only have to touch the disk to read it faster...at this scale not using solid state disks, way too expensive. Haystack great for serving photos, what are the problems with it? Replication factors...reason for fractional numbers, in practice, have failures and so get fractions...replicates between 3-4 times when you get 3.6, on average at least three copies....too expensive, everyone’s photos between 3-4 times each, if you can remove one replication, save on storage costs....2010-2014 when people started paying attention to Facebook, specifically privacy. Diff between Haystack and F4, F4 deletes quickly while Haystack only marked items for deletion. Cheaper storage, better deletes. What is the trick, how did they do it? Same thing as used in RAID5, parity bits to track data and do something with it...good enough to erase data, stripe it. Encrypt everything, every photo has an encryption key stored in a separate database. If you delete the encryption key, you delete the data. B/c modern systems replicate data everywhere, logs journals etc. All over the place to guard against failures...copies on top of copies on top of copies on every scale. If you encrypt, delete the key, everything is gone. Haystack becomes the photo cache, photos being accessed quickly. For worm storage, F4 is used for that...why not use f4 for everything. Parity stuff and it has fewer replicas to read from, with multiple replicas, can read from them in parallel so Haystack is good for hot stuff while F4 is better for the colder but not completely cold.

Amazon Glacier...cheap storage, really cheap but cannot access quickly, from Glacier to S3 it take hours. Not online, might be in tapes sitting offline, so not good for immediate access data. So it will take long from f4 to haystack but not that long, a couple of seconds.

Cold storage for disaster recovery....traditionally what cold storage is about but not useful for online services.
Haystack has a durability guarantee with replication but also a performance benefit. How much engineering goes into these seemingly trivial uses.

== Headline text ==

<ul>
<li style="font-weight: 400;">CDN</li>
<ul>
<li style="font-weight: 400;">CDN stands for Content Distribution Network, essentially a large scale cache. Idea is you have servers replicated around the world, and have these servers close to people, and are for locality, instead of going to far away servers.</li>
<li style="font-weight: 400;">Akamai is a big name in CDNs, developers were physicists at MIT and used fluid mechanics to model peoples usage and data flow.</li>
<li style="font-weight: 400;">A CDN is necessary for large scale systems.</li>
<li style="font-weight: 400;">CDNs do not serve entire websites. They are bad at serving dynamic content - email is not in CDN, makes no sense. Every time what you see is specific to you, it does not make sense for a CDN, only makes sense for highly replicated data, not personalized data.</li>
<li style="font-weight: 400;">Browsers go to CDN to get some data, but also goes to Haystack.</li>
<li style="font-weight: 400;">CDNs normally rewrite the URL.</li>
<li style="font-weight: 400;">AMP is CDN for websites. Do stuff in this format and google will help you deliver it faster and also provide you analytics.</li>
</ul>
<li style="font-weight: 400;">Haystack</li>
<ul>
<li style="font-weight: 400;">NFS was initially used, but too many disk accesses because of metadata. Makes sense to separate metadata in classic file systems but not for facebook.</li>
<li style="font-weight: 400;">Needle merged metadata with data, can get away because data is immutable.</li>
<li style="font-weight: 400;">A bunch of photos combined into a single file, then this stored on a classic file system.</li>
<li style="font-weight: 400;">Name of file is just an offset into a big file where you fetch both metadata and data, instead of going through the inode structure.</li>
<li style="font-weight: 400;">Fast is not enough, they have to replicate it.</li>
<li style="font-weight: 400;">Haystack uses XFS, does its own replication, f4 built on HDFS</li>
<li style="font-weight: 400;">Replication factor never exactly 3 because of servers coming back, and other issues. This is why we get a replication factor of 3.6</li>
<ul>
<li style="font-weight: 400;">This is too high of a replication factor.</li>
</ul>
</ul>
<li style="font-weight: 400;">f4</li>
<ul>
<li style="font-weight: 400;">People started caring about FB at this time, in particular security and deletes.</li>
<li style="font-weight: 400;">f4 enables deletes, f4 marks things for deletion, and they will be eventually reclaimed. Cheaper storage, better deletes.</li>
<li style="font-weight: 400;">Facebook encrypts everything, every photo has its encryption key which is stored in another database. Delete the encryption key.</li>
<li style="font-weight: 400;">The only way to delete things quickly in modern systems is to have it all encrypted and delete the key.</li>
<li style="font-weight: 400;">Haystack is for hot storage, f4 is for warm storage. Haystack in effect becomes the photo CDN.</li>
<li style="font-weight: 400;">Why not use f4 for everything?</li>
<ul>
<li style="font-weight: 400;">f4 is inherently slower because it has parity stuff.</li>
<li style="font-weight: 400;">Best thing about the multiple replicas in Haystack is its faster to read from.</li>
</ul>
<li style="font-weight: 400;">Cold storage: data which is not accessed often and takes time to retreive. Ex. Amazon glacier. Amazon Glacier is really cheap, but cant access it immediately.</li>
<li style="font-weight: 400;">Facebook doesn't really have cold storage.</li>
</ul>
</ul>