DistOS 2021F 2021-11-04
Jump to navigation
Jump to search
Notes
Lecture 14 ---------- BLOBs - photos, videos, or other binary data - immutable traditional filesystems have to support reading & writing at any time - here, we just need creation and reading Haystack - what's the problem that it solved? - making photo storage more efficient - higher density - more requests per second How do you get faster? Do less work - reduce disk operations by reducing what has to be read - they benchmarked, noticed metadata lookups were taking lots of time - because you first read metadata, then read data Idea: keep a cache of metadata in RAM, so we can just do data I/O - to do this, need to minimize metadata Why do regular filesystems have so much metadata? (Does anyone know how ISO9660/CD-ROM/DVD/etc is organized?) Traditional filesystems have to allow files to grow and shrink arbitrarily - have to add or remove blocks from a file - removed blocks need to be allocated to new files optical media filesystems are designed to be read only - so can be optimized for writes regular filesystems are optimized by organizing files into ranges of blocks - want to have as few as possible, but can have sequences of blocks anywhere on disk With a read-only filesystem, we can just lay out files one at a time, each getting as many blocks as it needs in sequence - file a is blocks 2-50, file b is blocks 51-1000, etc. Note this GREATLY reduces size of metadata - because metadata includes the list of blocks - what's the most compact way to represent a list of blocks? a range 1, 5, 100, 2, ... vs 2-10? remember with BLOBs we're talking about large files - megabytes up - blocks are 4k normally With haystack, we have a big haystack with lots of needles - a needle is just a file (data + metadata), but stored sequentially (metadata first, then data) In memory, I just need the name of the file and its starting and ending blocks - from that I get all the data at once What happens when I delete a needle (a file)? - fragmentation, i.e., unused storage Facebook would mark a file as being deleted, but it wouldn't delete it (and space would only be reclaimed when the entire haystack was compacted) - could stick around for a long time A key goal for F4 was to facilitate quick data deletion - so delete => delete encryption key Diversion: How do SSD's work? - randomly accessible array of blocks (interface) - looks like a hard disk mostly, except there's one problem - writes damage individual storage cells - generally only good for hundreds to thousands of writes, individually, then they fail Normal filesystems have hot spot, places where there are lots of write accesses - for example, inodes, especially the accessed timestamp So, for any sort of flash storage to be practical, we have to smooth out write hotspots - can't write to the same area of flash repeatedly Solution: every write goes to a new set of storage cells - this is "wear leveling" But what if I want to keep writing to the same block? - abstraction level: writing to block 2000 repeatedly actually goes to many different storage cells - keeps track what is the current block 2000, returns that when it is requested So, there is massive duplication of repeatedly written files in SSDs. - and there is background garbage collection to get rid of old ones Writing always to new places, basically "appending" to the disk always, is what a "log-structured filesystem" does - makes writes fast at the expense of slower reads unless you use lots of cache haystack is like a log-structured filesystem, kind of (whole files are written so don't have the fragmentation issue of regular log-structured filesystems) Why "warm" storage? - hot is stuff that is being frequently accessed - cold is infrequently accessed so warm is in between - been around, should be pretty fast, but doesn't need to be the fastest - can optimize for durability and space efficiency want to have the same kind of guarantees we get with lots of replicas, but with fewer replicas - hence, erasure coding Same idea as RAID-5, but more sophisticated - can lose a disk but still reconstruct the data (at reduced performance) Trade computation for storage space