DistOS 2021F 2021-11-04

Notes

Lecture 14
----------

BLOBs
 - photos, videos, or other binary data
 - immutable

traditional filesystems have to support reading & writing at any time
 - here, we just need creation and reading

Haystack
 - what's the problem that it solved?
   - making photo storage more efficient
      - higher density
      - more requests per second

How do you get faster?  Do less work
 - reduce disk operations by reducing what has to be read
 - they benchmarked, noticed metadata lookups were taking lots of time
   - because you first read metadata, then read data

Idea: keep a cache of metadata in RAM, so we can just do data I/O
 - to do this, need to minimize metadata

Why do regular filesystems have so much metadata?
(Does anyone know how ISO9660/CD-ROM/DVD/etc is organized?)

Traditional filesystems have to allow files to grow and shrink arbitrarily
 - have to add or remove blocks from a file
   - removed blocks need to be allocated to new files

optical media filesystems are designed to be read only
 - so can be optimized for writes

regular filesystems are optimized by organizing files into ranges of blocks
 - want to have as few as possible, but can have sequences of blocks
   anywhere on disk

With a read-only filesystem, we can just lay out files one at a time, each getting as many blocks as it needs in sequence
 - file a is blocks 2-50, file b is blocks 51-1000, etc.
Note this GREATLY reduces size of metadata
 - because metadata includes the list of blocks
 - what's the most compact way to represent a list of blocks?  a range

1, 5, 100, 2, ... vs 2-10?

remember with BLOBs we're talking about large files
  - megabytes up
  - blocks are 4k normally

With haystack, we have a big haystack with lots of needles
 - a needle is just a file (data + metadata), but stored sequentially
   (metadata first, then data)

In memory, I just need the name of the file and its starting and ending blocks
 - from that I get all the data at once

What happens when I delete a needle (a file)?
 - fragmentation, i.e., unused storage

Facebook would mark a file as being deleted, but it wouldn't delete it
(and space would only be reclaimed when the entire haystack was compacted)
 - could stick around for a long time

A key goal for F4 was to facilitate quick data deletion
 - so delete => delete encryption key

Diversion: How do SSD's work?
 - randomly accessible array of blocks (interface)
 - looks like a hard disk mostly, except there's one problem
    - writes damage individual storage cells
    - generally only good for hundreds to thousands of writes,
      individually, then they fail

Normal filesystems have hot spot, places where there are lots of write accesses
 - for example, inodes, especially the accessed timestamp

So, for any sort of flash storage to be practical, we have to smooth out write hotspots
 - can't write to the same area of flash repeatedly

Solution: every write goes to a new set of storage cells
 - this is "wear leveling"

But what if I want to keep writing to the same block?
 - abstraction level: writing to block 2000 repeatedly actually goes
   to many different storage cells
     - keeps track what is the current block 2000, returns that when it is requested

So, there is massive duplication of repeatedly written files in SSDs.
 - and there is background garbage collection to get rid of old ones


Writing always to new places, basically "appending" to the disk always,
is what a "log-structured filesystem" does
 - makes writes fast at the expense of slower reads unless you use lots of cache

haystack is like a log-structured filesystem, kind of (whole files are written so don't have the fragmentation issue of regular log-structured filesystems)

Why "warm" storage?
 - hot is stuff that is being frequently accessed
 - cold is infrequently accessed

so warm is in between
 - been around, should be pretty fast, but doesn't need to be the fastest
    - can optimize for durability and space efficiency

want to have the same kind of guarantees we get with lots of replicas, but with fewer replicas
 - hence, erasure coding

Same idea as RAID-5, but more sophisticated
 - can lose a disk but still reconstruct the data (at reduced performance)

Trade computation for storage space