DistOS 2023W 2023-03-29: Difference between revisions

Latest revision as of 18:20, 29 March 2023

Notes

Haystack & f4
-------------

 - What problem does Haystack solve?  What problem does f4 solve?
 - How does Haystack work?  How does f4 work?
 - what behavior patterns are needed to make Haystack and f4 work well?
   - how grounded are these patterns in human behavior?
 - What is the relationship between Haystack and f4?
 - What are the key technical insights used to build these systems?
 - To what degree could these systems be used for other tasks?

We'll meet again at 12:18

What problems do these systems solve?
 - high-performance, cheap photo/BLOB storage

How do we make it high performance & cheap?
 - in part, take advantage of access patterns (long tail)
   - most photos won't be accessed for a long time
   - but recent ones will, and once accessed once they'll be accessed a lot
   - but people will sometimes just scroll through lots of photos too, so
     that should work okay

Before Haystack, how did they store images?
 - NFS appliances (NetApp?)
 - CDNs

What is a content delivery network (CDN)?
 - classically, a way for a web server to offload static assets/cache frequently accessed resources
   - with "edge computing" this is changing, puts dynamic data into the CDN
 - idea is, if large static assets are served from servers close to clients, it will be faster and will allow the primary web server & database to have less of a load and so can scale better

A big problem for any CDN is where to put data?
 - you want the right data to be "near" the right client
 - but you can't store all data near every client
 - so, you have to predict where data is going to go
    - big insight behind Akamai

Facebook was paying too much money using standard CDNs and it wasn't really appropriate for a lot of their workload
 - no good for the tail

"long tail" refers to the distribution
 - not normal, more power law
 - most frequent have most of the area under the curve
 - but, low frequency things collectively are a huge chuck of the area


Big idea of Haystack is if we can simplify metadata storage we can optimize data access times
 - really, want to get everything we need in one read from disk with no seeks

What's the problem with deleting photos in Haystack?
 - marked for deletion, still available until compaction

Facebook needed to fix this, wanted deletes to happen immediately

How does f4 fix this?
 - encrypt all data
 - delete key to delete data

So why warm storage?
 - different tradeoff between cost, performance, and durability

Big insight is that Haystack is expensive
 - 3 copies of all data

Can we have fewer copies but still have similar reliability guarantees?
 - yes, with erasure codes

Erasure codes encode data in a way that you can recover lost data
 - robust to erasure