DistOS 2018F 2018-11-14
Readings
- Beaver et al., "Finding a needle in Haystack: Facebook’s photo storage" (OSDI 2010)
- Muralidhar et al., "f4: Facebook's Warm BLOB Storage System" (OSDI 2014)
Notes
Lecture Nov 14: Haystack & F4
CDN, when talking about Haystack, talking about CDNs. What is a CDN...content distribution network. Large-scale cache for data. The idea is you have servers replicated around the world. Servers close to people and the things people want...instead of going to main server, go to local server and get a copy from there. Pioneer Akamai .... folks from MIT....models that became content distribution networks necessary for large scale systems. Reduces latency of page loads. Do not serve entire website, CDNs are bad at serving dynamic content such as email...a web app is not in a CDN b/c it makes no sense...no-one should be asking for the same email, should not be replicated....what you see is specific to you does not make sense. Only makes sense if replicated across multiple page views. Code on client is going to have to get custom data, that is where a CDN does not work. See Figure 3...to CDN and to Haystack storage. Original solution....photos are the problem...original solution was NFS....this sucks but why? Bad performance....too many disk accesses why? Metadata...having to go access the iNode and then the actual contents of the file was too much...for a normal file system, of course you have to separate the metadata from the data....metadata has different access patterns, did not make sense for Facebook...didn’t want separate reads for both...why can you get away with merging metadata with data...the needle....figure 5....metadata and data are intertwined....why can they get away with the format....the data is immutable. The game changes with how you deal with metadata or data b/c they don’t change...Where is the photo, find the photo and then everything about it in one place. Keep track of headers then read everything about it in one go. Just need the offset, do not need a separate iNode...no pointer to iNode, just need big set of files where it is and the offset of the photo...reduce file operations and thus increase performance. It is just realizing that your data is immutable. Separating metadata and data together, fast access pattern.
Have a photo name, gets you speed. Has to be immutable, fast is not good if it is not durable. We need protection and redundancy. Don’t store every photo once, store multiple times. The indexes are in memory.
F4 is built on the Hadoop and Haystack uses FX
Only have to touch the disk to read it faster...at this scale not using solid state disks, way too expensive. Haystack great for serving photos, what are the problems with it? Replication factors...reason for fractional numbers, in practice, have failures and so get fractions...replicates between 3-4 times when you get 3.6, on average at least three copies....too expensive, everyone’s photos between 3-4 times each, if you can remove one replication, save on storage costs....2010-2014 when people started paying attention to Facebook, specifically privacy. Diff between Haystack and F4, F4 deletes quickly while Haystack only marked items for deletion. Cheaper storage, better deletes. What is the trick, how did they do it? Same thing as used in RAID5, parity bits to track data and do something with it...good enough to erase data, stripe it. Encrypt everything, every photo has an encryption key stored in a separate database. If you delete the encryption key, you delete the data. B/c modern systems replicate data everywhere, logs journals etc. All over the place to guard against failures...copies on top of copies on top of copies on every scale. If you encrypt, delete the key, everything is gone. Haystack becomes the photo cache, photos being accessed quickly. For worm storage, F4 is used for that...why not use f4 for everything. Parity stuff and it has fewer replicas to read from, with multiple replicas, can read from them in parallel so Haystack is good for hot stuff while F4 is better for the colder but not completely cold.
Amazon Glacier...cheap storage, really cheap but cannot access quickly, from Glacier to S3 it take hours. Not online, might be in tapes sitting offline, so not good for immediate access data. So it will take long from f4 to haystack but not that long, a couple of seconds.
Cold storage for disaster recovery....traditionally what cold storage is about but not useful for online services. Haystack has a durability guarantee with replication but also a performance benefit. How much engineering goes into these seemingly trivial uses.