Operating Systems 2021F Lecture 16

From Soma-notes

Video

Video from the lecture given on November 11, 2021 is now available:

Video is also available through Brightspace (Resources->Class zoom meetings->Cloud Recordings tab)

Notes

Lecture 16
----------
 - interviews
    - there will be more, just figuring my schedule
      for next week
    - yes, final will replace midterm if you do better
 - A3 will be posted by tomorrow, will be going over it a bit today

Today: filesystems

Filesystem
 - collection of files & directories
 - namespace for inode numbers
 - way to transform a block device into a place
   you can store files
 - data structure that is accessed using the file API


From the kernel's perspective
 - has file system call interface
   (open, read, write, etc)
   (also for directories)
 - when it gets a pathname, where does it look to get the data?
   - it figures out which filesystem is responsible for the
     containing directory
   - it then asks that filesystem to do the file operations
 - this abstraction is knows as the "VFS" (virtual filesystem) layer

/proc/filesystems lists all the different kinds of filesystems your Linux system knows about currently
 - the ones with "nodev" beside them are ones that have no corresponding storage device
    - means that while you access them with file-related system calls,
      what you're getting back isn't from a storage device, it
      is something else
    - known as pseudo filesystems it seems?

df: dump filesystems
 - show currently mounted filesystems

mount: add a filesystem to our current file hierarchy
  - mountpoint: where those files should go,
    normally an empty directory
    (if not empty, those files will be hidden)

df .
 - tells me the filesystem responsible for the current directory

df -a
 - show ALL the filesystems
   - including ones with no corresponding device
     (pseudo filesystems)

Pseudo filesystems like /proc tend to have some weird info that give them away
 - inode numbers are weird
 - file sizes make no sense (are often zero)
 - timestamps aren't consistent

The above is all true because those fields are just made up when you access the files, they aren't "stored" anywhere
 - file metadata for pseudo filesystems isn't that significant

mount
 - get full information on mounted filesystems (with no args)
 - or you can use it to add filesystems to file hierarchy

mounting a real filesystem
 - have access to data on a block device

mounting a pseudo filesystem
 - have access to some new capability, typically kernel data structures
   or runtime data that doesn't need to be stored on disk
 - essentially all data is in RAM or generated algorithmically
    - just depends on the filesystem type

/run
 - it is a "tmpfs" - temporary filesystem
    - data is temporary
 - no corresponding block device
 - it is a "RAM disk" - full filesystem, but
   *data is lost when system reboots*
 - used for PID files and lock files mainly
    - PIDs will change when system is rebooted
    - locks shouldn't be held across reboot
 - note here "tmpfs" is the filesystem type,
   /run is the mountpoint

Notice that /tmp is NOT in tmpfs
 - it is in the filesystem of the root filesystem, normally ext4
 - BUT files in /tmp are erased on every reboot
    - but that happens due to a boot time script

If you boot with a live CD
 - data in /run would already be lost
 - data in /tmp should still be there

/var/tmp is like /tmp, except it is NOT erased when rebooted

In your VM, the root filesystem is mounted as follows:

  /dev/mapper/vg0-lv--0 on / type ext4 (rw,relatime)

the "ext4" means that it is of type ext4.

So, why isn't /tmp a tmpfs?
 - historic/distribution reasons (I know some distributions
   that do put /tmp/ on tmpfs)
 - can store a lot more in /tmp normally than you could
   in /tmpfs (as its storage is limited by RAM/virtual memory)

(I'll discuss lock files later)


Different operating systems have their own native filesystems
 - MSDOS: fat, vfat, fat32
 - Windows: NTFS
 - MacOS: HFS, HFS+, APFS
 - FreeBSD: UFS, zfs
 - IRIX: xfs
 - Linux: ext2, ext3, *ext4*, btrfs, squashfs

LOTS of filesystem types

These are all regular filesystems used for regular disks,
originally developed for magnetic hard drives, not SSDs
 - except APFS?

Why so many?
 - some support different file sizes
 - different performance characteristics
 - reliability/durability
 - licensing
 - stubbornness/Not Invented Here

Key difference
 - some are designed for UNIX-like systems (POSIX compliant)
 - others are not!
 - POSIX-compliant ones use inodes, others generally do not

Remember a filesystem is just a data structure
 - so the filesystem type is the kind of data structure

Note that some filesystems have specialized purposes
 - squashfs is designed to be compressed and read only

Why would you want a read only filesystem?
 - storage medium is read only (e.g. optical media)
   iso9660, etc
 - for starting up the system <--- we'll get to this

Look up the youtube channel "technology connections", whole
series on optical media


So now let's make and use a filesystem

What do we need?
 - a block device

What will we get?
 - files stored on the block device

Challenge
 - we don't have any devices we can physically connect to our VM

Workaround
 - we'll make a file that will behave like a block device

To make an empty file, use dd, e.g.
  dd if=/dev/zero of=fakeblks bs=4096 count=100000

if: input file
of: output file
bs: block size
count: count

So reads count blocks of size bs from input file to output file
 - note it does exactly one read system call and one write system call
   for each block transferred
    - we read count*blocksize from if
    - we write count*blocksize to of

(Note we are reading from /dev/zero, so we are reading from an infinite source of zero bytes)

Note that you can't just use touch
 - it will just make a file of size zero

You could use truncate
 - but we'll get to that

To look at the contents of a binary file, you can use od
 - with the -a option, translates each byte to its corresponding
   named character

If we run od on the file before and after running mkfs.ext4,
we can see what bytes were modified in the file


The kernel has many possibilities in determining how to service a given file operation request
 - if it is for a file on a regular filesystem,
   it uses that filesystem's code to interpret data
   read from or written to the mounted block device

 - if it is for a pseudo filesystem,
   it runs the code in the kernel that implements the file
   operations
     - they can do essentially anything

 - if it is for a character device
    - runs the code for the character device

/dev/zero is a character device that always returns null bytes
 - specifically, when you do a "read" system call, it
   fills the buffer given to it with all zero bytes
 - will do this as many times as asked

There's also /dev/urandom and /dev/random
 - infinite random bytes
    - but /dev/random is VERY slow
      because it tries to return "real" random bytes

A filesystem is a data structure.  So, we need ways to
 - create/initialize the data structure: mkfs
 - validate and ideally repair the data structure: fsck

Why bother validating and repairing filesystem data structures?
 - we don't normally do that for hash tables or binary trees?
   - but they only stick around for short periods of time
   - if they get messed up, we just restart the program

But filesystems are meant to last for years, and hardware & software can fail over that time
 - cosmic rays
 - manufacturing defects
 - code bugs...

So filesystems are designed to be repairable
 - should be able to recover overall structure
 - ideally, preserve what data you can when things are damaged

fsck tries its best to repair filesystems
 - but it relies on how the filesystem is structured
   in order to do its work

Do pseudo filesystems have fsck?
 - no, because it wouldn't make sense,
   no persistent state to fix

note that fsck is specific to each filesystem
 - every data structure needs its own specialized repair mechanisms

What allows filesystems to be repaired?
 - lots of redundancy!

"pointers" in filesystems are always bidirectional
 - so if one is missing we can recover it
 - (like a doubly linked list)

Some data tells us how the rest of the data is organized
 - this is the "superblock"
    (think of it as the root node of a tree)
 - store multiple copies of the superblock because
   if this is lost we lose everything

Remember filesystems are data structures for organizing blocks
 - a block is fixed sized, nowadays generally 4K, but some small
   power of 2
 - locations in filesystems are in terms of block numbers,
   not addresses
     e.g., block 2000, not address 265101561

We always access the data structure by reading or writing entire
ranges of blocks, not ranges of bytes