DistOS 2023W 2023-03-29: Difference between revisions
Created page with "==Notes== <pre> Haystack & f4 ------------- - What problem does Haystack solve? What problem does f4 solve? - How does Haystack work? How does f4 work? - what behavior patterns are needed to make Haystack and f4 work well? - how grounded are these patterns in human behavior? - What is the relationship between Haystack and f4? - What are the key technical insights used to build these systems? - To what degree could these systems be used for other tasks? We'l..." |
No edit summary |
||
| Line 15: | Line 15: | ||
We'll meet again at 12:18 | We'll meet again at 12:18 | ||
What problems do these systems solve? | |||
- high-performance, cheap photo/BLOB storage | |||
How do we make it high performance & cheap? | |||
- in part, take advantage of access patterns (long tail) | |||
- most photos won't be accessed for a long time | |||
- but recent ones will, and once accessed once they'll be accessed a lot | |||
- but people will sometimes just scroll through lots of photos too, so | |||
that should work okay | |||
Before Haystack, how did they store images? | |||
- NFS appliances (NetApp?) | |||
- CDNs | |||
What is a content delivery network (CDN)? | |||
- classically, a way for a web server to offload static assets/cache frequently accessed resources | |||
- with "edge computing" this is changing, puts dynamic data into the CDN | |||
- idea is, if large static assets are served from servers close to clients, it will be faster and will allow the primary web server & database to have less of a load and so can scale better | |||
A big problem for any CDN is where to put data? | |||
- you want the right data to be "near" the right client | |||
- but you can't store all data near every client | |||
- so, you have to predict where data is going to go | |||
- big insight behind Akamai | |||
Facebook was paying too much money using standard CDNs and it wasn't really appropriate for a lot of their workload | |||
- no good for the tail | |||
"long tail" refers to the distribution | |||
- not normal, more power law | |||
- most frequent have most of the area under the curve | |||
- but, low frequency things collectively are a huge chuck of the area | |||
Big idea of Haystack is if we can simplify metadata storage we can optimize data access times | |||
- really, want to get everything we need in one read from disk with no seeks | |||
What's the problem with deleting photos in Haystack? | |||
- marked for deletion, still available until compaction | |||
Facebook needed to fix this, wanted deletes to happen immediately | |||
How does f4 fix this? | |||
- encrypt all data | |||
- delete key to delete data | |||
So why warm storage? | |||
- different tradeoff between cost, performance, and durability | |||
Big insight is that Haystack is expensive | |||
- 3 copies of all data | |||
Can we have fewer copies but still have similar reliability guarantees? | |||
- yes, with erasure codes | |||
Erasure codes encode data in a way that you can recover lost data | |||
- robust to erasure | |||
</pre> | </pre> | ||
Latest revision as of 18:20, 29 March 2023
Notes
Haystack & f4
-------------
- What problem does Haystack solve? What problem does f4 solve?
- How does Haystack work? How does f4 work?
- what behavior patterns are needed to make Haystack and f4 work well?
- how grounded are these patterns in human behavior?
- What is the relationship between Haystack and f4?
- What are the key technical insights used to build these systems?
- To what degree could these systems be used for other tasks?
We'll meet again at 12:18
What problems do these systems solve?
- high-performance, cheap photo/BLOB storage
How do we make it high performance & cheap?
- in part, take advantage of access patterns (long tail)
- most photos won't be accessed for a long time
- but recent ones will, and once accessed once they'll be accessed a lot
- but people will sometimes just scroll through lots of photos too, so
that should work okay
Before Haystack, how did they store images?
- NFS appliances (NetApp?)
- CDNs
What is a content delivery network (CDN)?
- classically, a way for a web server to offload static assets/cache frequently accessed resources
- with "edge computing" this is changing, puts dynamic data into the CDN
- idea is, if large static assets are served from servers close to clients, it will be faster and will allow the primary web server & database to have less of a load and so can scale better
A big problem for any CDN is where to put data?
- you want the right data to be "near" the right client
- but you can't store all data near every client
- so, you have to predict where data is going to go
- big insight behind Akamai
Facebook was paying too much money using standard CDNs and it wasn't really appropriate for a lot of their workload
- no good for the tail
"long tail" refers to the distribution
- not normal, more power law
- most frequent have most of the area under the curve
- but, low frequency things collectively are a huge chuck of the area
Big idea of Haystack is if we can simplify metadata storage we can optimize data access times
- really, want to get everything we need in one read from disk with no seeks
What's the problem with deleting photos in Haystack?
- marked for deletion, still available until compaction
Facebook needed to fix this, wanted deletes to happen immediately
How does f4 fix this?
- encrypt all data
- delete key to delete data
So why warm storage?
- different tradeoff between cost, performance, and durability
Big insight is that Haystack is expensive
- 3 copies of all data
Can we have fewer copies but still have similar reliability guarantees?
- yes, with erasure codes
Erasure codes encode data in a way that you can recover lost data
- robust to erasure