Notes
Lecture 14
----------
BLOBs
- photos, videos, or other binary data
- immutable
traditional filesystems have to support reading & writing at any time
- here, we just need creation and reading
Haystack
- what's the problem that it solved?
- making photo storage more efficient
- higher density
- more requests per second
How do you get faster? Do less work
- reduce disk operations by reducing what has to be read
- they benchmarked, noticed metadata lookups were taking lots of time
- because you first read metadata, then read data
Idea: keep a cache of metadata in RAM, so we can just do data I/O
- to do this, need to minimize metadata
Why do regular filesystems have so much metadata?
(Does anyone know how ISO9660/CD-ROM/DVD/etc is organized?)
Traditional filesystems have to allow files to grow and shrink arbitrarily
- have to add or remove blocks from a file
- removed blocks need to be allocated to new files
optical media filesystems are designed to be read only
- so can be optimized for writes
regular filesystems are optimized by organizing files into ranges of blocks
- want to have as few as possible, but can have sequences of blocks
anywhere on disk
With a read-only filesystem, we can just lay out files one at a time, each getting as many blocks as it needs in sequence
- file a is blocks 2-50, file b is blocks 51-1000, etc.
Note this GREATLY reduces size of metadata
- because metadata includes the list of blocks
- what's the most compact way to represent a list of blocks? a range
1, 5, 100, 2, ... vs 2-10?
remember with BLOBs we're talking about large files
- megabytes up
- blocks are 4k normally
With haystack, we have a big haystack with lots of needles
- a needle is just a file (data + metadata), but stored sequentially
(metadata first, then data)
In memory, I just need the name of the file and its starting and ending blocks
- from that I get all the data at once
What happens when I delete a needle (a file)?
- fragmentation, i.e., unused storage
Facebook would mark a file as being deleted, but it wouldn't delete it
(and space would only be reclaimed when the entire haystack was compacted)
- could stick around for a long time
A key goal for F4 was to facilitate quick data deletion
- so delete => delete encryption key
Diversion: How do SSD's work?
- randomly accessible array of blocks (interface)
- looks like a hard disk mostly, except there's one problem
- writes damage individual storage cells
- generally only good for hundreds to thousands of writes,
individually, then they fail
Normal filesystems have hot spot, places where there are lots of write accesses
- for example, inodes, especially the accessed timestamp
So, for any sort of flash storage to be practical, we have to smooth out write hotspots
- can't write to the same area of flash repeatedly
Solution: every write goes to a new set of storage cells
- this is "wear leveling"
But what if I want to keep writing to the same block?
- abstraction level: writing to block 2000 repeatedly actually goes
to many different storage cells
- keeps track what is the current block 2000, returns that when it is requested
So, there is massive duplication of repeatedly written files in SSDs.
- and there is background garbage collection to get rid of old ones
Writing always to new places, basically "appending" to the disk always,
is what a "log-structured filesystem" does
- makes writes fast at the expense of slower reads unless you use lots of cache
haystack is like a log-structured filesystem, kind of (whole files are written so don't have the fragmentation issue of regular log-structured filesystems)
Why "warm" storage?
- hot is stuff that is being frequently accessed
- cold is infrequently accessed
so warm is in between
- been around, should be pretty fast, but doesn't need to be the fastest
- can optimize for durability and space efficiency
want to have the same kind of guarantees we get with lots of replicas, but with fewer replicas
- hence, erasure coding
Same idea as RAID-5, but more sophisticated
- can lose a disk but still reconstruct the data (at reduced performance)
Trade computation for storage space