DistOS 2023W 2023-03-29
Jump to navigation
Jump to search
Notes
Haystack & f4 ------------- - What problem does Haystack solve? What problem does f4 solve? - How does Haystack work? How does f4 work? - what behavior patterns are needed to make Haystack and f4 work well? - how grounded are these patterns in human behavior? - What is the relationship between Haystack and f4? - What are the key technical insights used to build these systems? - To what degree could these systems be used for other tasks? We'll meet again at 12:18 What problems do these systems solve? - high-performance, cheap photo/BLOB storage How do we make it high performance & cheap? - in part, take advantage of access patterns (long tail) - most photos won't be accessed for a long time - but recent ones will, and once accessed once they'll be accessed a lot - but people will sometimes just scroll through lots of photos too, so that should work okay Before Haystack, how did they store images? - NFS appliances (NetApp?) - CDNs What is a content delivery network (CDN)? - classically, a way for a web server to offload static assets/cache frequently accessed resources - with "edge computing" this is changing, puts dynamic data into the CDN - idea is, if large static assets are served from servers close to clients, it will be faster and will allow the primary web server & database to have less of a load and so can scale better A big problem for any CDN is where to put data? - you want the right data to be "near" the right client - but you can't store all data near every client - so, you have to predict where data is going to go - big insight behind Akamai Facebook was paying too much money using standard CDNs and it wasn't really appropriate for a lot of their workload - no good for the tail "long tail" refers to the distribution - not normal, more power law - most frequent have most of the area under the curve - but, low frequency things collectively are a huge chuck of the area Big idea of Haystack is if we can simplify metadata storage we can optimize data access times - really, want to get everything we need in one read from disk with no seeks What's the problem with deleting photos in Haystack? - marked for deletion, still available until compaction Facebook needed to fix this, wanted deletes to happen immediately How does f4 fix this? - encrypt all data - delete key to delete data So why warm storage? - different tradeoff between cost, performance, and durability Big insight is that Haystack is expensive - 3 copies of all data Can we have fewer copies but still have similar reliability guarantees? - yes, with erasure codes Erasure codes encode data in a way that you can recover lost data - robust to erasure