DistOS 2023W 2023-03-29: Difference between revisions
Created page with "==Notes== <pre> Haystack & f4 ------------- - What problem does Haystack solve? What problem does f4 solve? - How does Haystack work? How does f4 work? - what behavior patterns are needed to make Haystack and f4 work well? - how grounded are these patterns in human behavior? - What is the relationship between Haystack and f4? - What are the key technical insights used to build these systems? - To what degree could these systems be used for other tasks? We'l..." |
No edit summary |
||
Line 15: | Line 15: | ||
We'll meet again at 12:18 | We'll meet again at 12:18 | ||
What problems do these systems solve? | |||
- high-performance, cheap photo/BLOB storage | |||
How do we make it high performance & cheap? | |||
- in part, take advantage of access patterns (long tail) | |||
- most photos won't be accessed for a long time | |||
- but recent ones will, and once accessed once they'll be accessed a lot | |||
- but people will sometimes just scroll through lots of photos too, so | |||
that should work okay | |||
Before Haystack, how did they store images? | |||
- NFS appliances (NetApp?) | |||
- CDNs | |||
What is a content delivery network (CDN)? | |||
- classically, a way for a web server to offload static assets/cache frequently accessed resources | |||
- with "edge computing" this is changing, puts dynamic data into the CDN | |||
- idea is, if large static assets are served from servers close to clients, it will be faster and will allow the primary web server & database to have less of a load and so can scale better | |||
A big problem for any CDN is where to put data? | |||
- you want the right data to be "near" the right client | |||
- but you can't store all data near every client | |||
- so, you have to predict where data is going to go | |||
- big insight behind Akamai | |||
Facebook was paying too much money using standard CDNs and it wasn't really appropriate for a lot of their workload | |||
- no good for the tail | |||
"long tail" refers to the distribution | |||
- not normal, more power law | |||
- most frequent have most of the area under the curve | |||
- but, low frequency things collectively are a huge chuck of the area | |||
Big idea of Haystack is if we can simplify metadata storage we can optimize data access times | |||
- really, want to get everything we need in one read from disk with no seeks | |||
What's the problem with deleting photos in Haystack? | |||
- marked for deletion, still available until compaction | |||
Facebook needed to fix this, wanted deletes to happen immediately | |||
How does f4 fix this? | |||
- encrypt all data | |||
- delete key to delete data | |||
So why warm storage? | |||
- different tradeoff between cost, performance, and durability | |||
Big insight is that Haystack is expensive | |||
- 3 copies of all data | |||
Can we have fewer copies but still have similar reliability guarantees? | |||
- yes, with erasure codes | |||
Erasure codes encode data in a way that you can recover lost data | |||
- robust to erasure | |||
</pre> | </pre> |
Latest revision as of 18:20, 29 March 2023
Notes
Haystack & f4 ------------- - What problem does Haystack solve? What problem does f4 solve? - How does Haystack work? How does f4 work? - what behavior patterns are needed to make Haystack and f4 work well? - how grounded are these patterns in human behavior? - What is the relationship between Haystack and f4? - What are the key technical insights used to build these systems? - To what degree could these systems be used for other tasks? We'll meet again at 12:18 What problems do these systems solve? - high-performance, cheap photo/BLOB storage How do we make it high performance & cheap? - in part, take advantage of access patterns (long tail) - most photos won't be accessed for a long time - but recent ones will, and once accessed once they'll be accessed a lot - but people will sometimes just scroll through lots of photos too, so that should work okay Before Haystack, how did they store images? - NFS appliances (NetApp?) - CDNs What is a content delivery network (CDN)? - classically, a way for a web server to offload static assets/cache frequently accessed resources - with "edge computing" this is changing, puts dynamic data into the CDN - idea is, if large static assets are served from servers close to clients, it will be faster and will allow the primary web server & database to have less of a load and so can scale better A big problem for any CDN is where to put data? - you want the right data to be "near" the right client - but you can't store all data near every client - so, you have to predict where data is going to go - big insight behind Akamai Facebook was paying too much money using standard CDNs and it wasn't really appropriate for a lot of their workload - no good for the tail "long tail" refers to the distribution - not normal, more power law - most frequent have most of the area under the curve - but, low frequency things collectively are a huge chuck of the area Big idea of Haystack is if we can simplify metadata storage we can optimize data access times - really, want to get everything we need in one read from disk with no seeks What's the problem with deleting photos in Haystack? - marked for deletion, still available until compaction Facebook needed to fix this, wanted deletes to happen immediately How does f4 fix this? - encrypt all data - delete key to delete data So why warm storage? - different tradeoff between cost, performance, and durability Big insight is that Haystack is expensive - 3 copies of all data Can we have fewer copies but still have similar reliability guarantees? - yes, with erasure codes Erasure codes encode data in a way that you can recover lost data - robust to erasure