DistOS 2023W 2023-03-29: Difference between revisions
|  Created page with "==Notes==  <pre> Haystack & f4 -------------   - What problem does Haystack solve?  What problem does f4 solve?  - How does Haystack work?  How does f4 work?  - what behavior patterns are needed to make Haystack and f4 work well?    - how grounded are these patterns in human behavior?  - What is the relationship between Haystack and f4?  - What are the key technical insights used to build these systems?  - To what degree could these systems be used for other tasks?  We'l..." | No edit summary | ||
| Line 15: | Line 15: | ||
| We'll meet again at 12:18 | We'll meet again at 12:18 | ||
| What problems do these systems solve? | |||
|  - high-performance, cheap photo/BLOB storage | |||
| How do we make it high performance & cheap? | |||
|  - in part, take advantage of access patterns (long tail) | |||
|    - most photos won't be accessed for a long time | |||
|    - but recent ones will, and once accessed once they'll be accessed a lot | |||
|    - but people will sometimes just scroll through lots of photos too, so | |||
|      that should work okay | |||
| Before Haystack, how did they store images? | |||
|  - NFS appliances (NetApp?) | |||
|  - CDNs | |||
| What is a content delivery network (CDN)? | |||
|  - classically, a way for a web server to offload static assets/cache frequently accessed resources | |||
|    - with "edge computing" this is changing, puts dynamic data into the CDN | |||
|  - idea is, if large static assets are served from servers close to clients, it will be faster and will allow the primary web server & database to have less of a load and so can scale better | |||
| A big problem for any CDN is where to put data? | |||
|  - you want the right data to be "near" the right client | |||
|  - but you can't store all data near every client | |||
|  - so, you have to predict where data is going to go | |||
|     - big insight behind Akamai | |||
| Facebook was paying too much money using standard CDNs and it wasn't really appropriate for a lot of their workload | |||
|  - no good for the tail | |||
| "long tail" refers to the distribution | |||
|  - not normal, more power law | |||
|  - most frequent have most of the area under the curve | |||
|  - but, low frequency things collectively are a huge chuck of the area | |||
| Big idea of Haystack is if we can simplify metadata storage we can optimize data access times | |||
|  - really, want to get everything we need in one read from disk with no seeks | |||
| What's the problem with deleting photos in Haystack? | |||
|  - marked for deletion, still available until compaction | |||
| Facebook needed to fix this, wanted deletes to happen immediately | |||
| How does f4 fix this? | |||
|  - encrypt all data | |||
|  - delete key to delete data | |||
| So why warm storage? | |||
|  - different tradeoff between cost, performance, and durability | |||
| Big insight is that Haystack is expensive | |||
|  - 3 copies of all data | |||
| Can we have fewer copies but still have similar reliability guarantees? | |||
|  - yes, with erasure codes | |||
| Erasure codes encode data in a way that you can recover lost data | |||
|  - robust to erasure | |||
| </pre> | </pre> | ||
Latest revision as of 18:20, 29 March 2023
Notes
Haystack & f4
-------------
 - What problem does Haystack solve?  What problem does f4 solve?
 - How does Haystack work?  How does f4 work?
 - what behavior patterns are needed to make Haystack and f4 work well?
   - how grounded are these patterns in human behavior?
 - What is the relationship between Haystack and f4?
 - What are the key technical insights used to build these systems?
 - To what degree could these systems be used for other tasks?
We'll meet again at 12:18
What problems do these systems solve?
 - high-performance, cheap photo/BLOB storage
How do we make it high performance & cheap?
 - in part, take advantage of access patterns (long tail)
   - most photos won't be accessed for a long time
   - but recent ones will, and once accessed once they'll be accessed a lot
   - but people will sometimes just scroll through lots of photos too, so
     that should work okay
Before Haystack, how did they store images?
 - NFS appliances (NetApp?)
 - CDNs
What is a content delivery network (CDN)?
 - classically, a way for a web server to offload static assets/cache frequently accessed resources
   - with "edge computing" this is changing, puts dynamic data into the CDN
 - idea is, if large static assets are served from servers close to clients, it will be faster and will allow the primary web server & database to have less of a load and so can scale better
A big problem for any CDN is where to put data?
 - you want the right data to be "near" the right client
 - but you can't store all data near every client
 - so, you have to predict where data is going to go
    - big insight behind Akamai
Facebook was paying too much money using standard CDNs and it wasn't really appropriate for a lot of their workload
 - no good for the tail
"long tail" refers to the distribution
 - not normal, more power law
 - most frequent have most of the area under the curve
 - but, low frequency things collectively are a huge chuck of the area
Big idea of Haystack is if we can simplify metadata storage we can optimize data access times
 - really, want to get everything we need in one read from disk with no seeks
What's the problem with deleting photos in Haystack?
 - marked for deletion, still available until compaction
Facebook needed to fix this, wanted deletes to happen immediately
How does f4 fix this?
 - encrypt all data
 - delete key to delete data
So why warm storage?
 - different tradeoff between cost, performance, and durability
Big insight is that Haystack is expensive
 - 3 copies of all data
Can we have fewer copies but still have similar reliability guarantees?
 - yes, with erasure codes
Erasure codes encode data in a way that you can recover lost data
 - robust to erasure