Operating Systems 2019F Lecture 16

Note that this lecture was given on Monday rather than Wednesday. Please don't come to class for lecture on Wednesday, Nov. 6, instead watch the video below!
Video

The video from the lecture given on November 4th, 2019 (special time) is now available.
Notes

Lecture 16
----------

Filesystem implementation

* worth looking at the textbook

One view of filesystems is as an implementation of open, read, write, close, etc

But also, with persistent storage, a filesystem is a data structure (on persistent media of some kind)

With standard in volatile memory data structures (ones in process memory), corrupted data structures aren't a big deal
 - restart the process

Persistent data structures have to maintain integrity for a long time
 - much more likely to see corruption
 - can't "restart" - that would result in lost data
   (a restart is a reformat)

Key design constraints for filesystems
 - store it in blocks (not bytes)
   - address storage with block numbers and offsets, not addresses
 - robust in the face of (accidental) corruption
 - random access is high latency (not true for SSDs)
 - sequential access is much lower latency

UNIX filesystems have inodes
 - to allow multiple hard links, so filenames refer to inodes, and inodes refer to data

Many, many UNIX filesystem implementations
 - ext4 is most common on Linux systems, but xfs is also used
 - bsd systems traditionally used ufs
 - btrfs, zfs are also important as next-generation filesystems

Basic organization of UNIX-like filesystems:
 * data blocks
 * inode blocks
   - metadata (timestamps, length, owner, perms, etc)
   - data or references to data blocks
 * directory blocks (may be a type of inode block)
 * superblock(s) <---
Normally divisions between block types is defined at filesystem creation/formatting (e.g. by mkfs).

Just like files have metadata, so do filesystems
The superblock stores metadata on the filesystem
 - needed for when it is mounted
 - says how big it is, what type, other filesystem-specific parameters

Normally the superblock is cached in RAM
 - will be updated periodically as filesystem state changes (e.g., number
   of free inodes)
 - will be written to disk periodically
 - most data is static

Superblock is normally the first block of a filesystem
 - convenient!
 - but...what if it gets corrupted?  You can't access the filesystem
 - but fortunately we have backups as part of the filesystem, at well-known
   offsets (which can vary by filesystem)

When a filesystem gets corrupted, what do you do?  you check it!
 - use fsck (file system check)

If fsck has to repair the filesystem because of errors, it could make things worse.
  - error recovery procedures may cause loss of data

If you really care about the data and don't have a current backup, do a low level (block level) copy of the filesystem *before* you run fsck.

You really want backups of your openstack instances!

The people who really understand filesystem structure are forensic analysts
 - helps them find deleted data

On most UNIX filesystems, delete is just unlink
 - so inodes are reclaimed when there are no hard links to it
 - but being reclaimed does not mean it is erased, it is just added to the pool of inodes that can be overwritten
 - same for data blocks

fsck on large disks
-------------------
 * fsck involves walking the filesystem data structure looking for inconsistencies
   - data blocks referred to by multiple inodes
   - deleted inodes that still have names
   - etc.
 * fsck thus involves a lot of random disk access

Older systems would tell you not to just turn off the computer...why?
 - late 1990s
 - because filesystem state was cached in RAM and would be lost
   - write caching
 - so filesystem on disk would be left in an inconsistent state,
   would need to run fsck to potentially repair
 - fsck times got longer and longer (10+ minutes), hours for RAID arrays

RAID = redundant array of inexpensive disks originally, ways to combine
       multiple disks into one logical disk with higher performance (raid-0,
       striping), better reliability (raid-1, mirroring), or a mix of the two
       (raid 5 and higher)

So, how do you avoid long fsck times?
 - write everything twice
 - write first to a journal: sequential record of activity, changes to fs
 - write later to main filesystem data structures (random access)

With a journal, on fsck you just have to check the journal
 - any inconsistencies will be there
 - doesn't handle inconsistencies in "old" data

next-generation filesystems have integrity protection for all data and metadata
 - zfs & btrfs
 - will catch silent corruption

log-structured filesystems
 - writing twice seems silly, especially if you are writing a lot
 - idea - why not make the journal the entire filesystem?
 - log will fill up, so you periodically clean things (write out new entries for up to date data and mark old ones as deleted)
 - originally developed for systems with high write load, such as online databases
 - but for today, we used them in solid-state disks

Log-structured "filesystems" aren't directly accessible in an SSD
 - they are below the block interface level
 - needed because writes to same portion of SSD will destroy it
 - log structure is for wear leveling

Holes in files in UNIX
 - if a file is all zeros, why allocate data blocks for those zeros?
 - if you just write a block of zeros, it will be allocated
 - but if you lseek past the end of a file and then write, the potions you skipped over will be filled in with virtual zeros (zero data that isn't stored)