Notes
Haystack & f4
-------------
- What problem does Haystack solve? What problem does f4 solve?
- How does Haystack work? How does f4 work?
- what behavior patterns are needed to make Haystack and f4 work well?
- how grounded are these patterns in human behavior?
- What is the relationship between Haystack and f4?
- What are the key technical insights used to build these systems?
- To what degree could these systems be used for other tasks?
We'll meet again at 12:18
What problems do these systems solve?
- high-performance, cheap photo/BLOB storage
How do we make it high performance & cheap?
- in part, take advantage of access patterns (long tail)
- most photos won't be accessed for a long time
- but recent ones will, and once accessed once they'll be accessed a lot
- but people will sometimes just scroll through lots of photos too, so
that should work okay
Before Haystack, how did they store images?
- NFS appliances (NetApp?)
- CDNs
What is a content delivery network (CDN)?
- classically, a way for a web server to offload static assets/cache frequently accessed resources
- with "edge computing" this is changing, puts dynamic data into the CDN
- idea is, if large static assets are served from servers close to clients, it will be faster and will allow the primary web server & database to have less of a load and so can scale better
A big problem for any CDN is where to put data?
- you want the right data to be "near" the right client
- but you can't store all data near every client
- so, you have to predict where data is going to go
- big insight behind Akamai
Facebook was paying too much money using standard CDNs and it wasn't really appropriate for a lot of their workload
- no good for the tail
"long tail" refers to the distribution
- not normal, more power law
- most frequent have most of the area under the curve
- but, low frequency things collectively are a huge chuck of the area
Big idea of Haystack is if we can simplify metadata storage we can optimize data access times
- really, want to get everything we need in one read from disk with no seeks
What's the problem with deleting photos in Haystack?
- marked for deletion, still available until compaction
Facebook needed to fix this, wanted deletes to happen immediately
How does f4 fix this?
- encrypt all data
- delete key to delete data
So why warm storage?
- different tradeoff between cost, performance, and durability
Big insight is that Haystack is expensive
- 3 copies of all data
Can we have fewer copies but still have similar reliability guarantees?
- yes, with erasure codes
Erasure codes encode data in a way that you can recover lost data
- robust to erasure