DistOS 2023W 2023-03-13: Difference between revisions

From Soma-notes
Created page with "==Discussion Questions== * List all the terms and acronyms in the Ceph paper and discus their meaning and relationship with each other. * To what degree is Ceph POSIX compliant? Is there a cost for this? * Discuss Figures 1-3 in Crush, what does each say?"
 
No edit summary
 
Line 4: Line 4:
* To what degree is Ceph POSIX compliant?  Is there a cost for this?
* To what degree is Ceph POSIX compliant?  Is there a cost for this?
* Discuss Figures 1-3 in Crush, what does each say?
* Discuss Figures 1-3 in Crush, what does each say?
==Notes==
<pre>
Ceph & CRUSH
------------
The big insight of this work is that there are different approaches to managing metadata in distributed filesystems
Is metadata a problem in single-device filesystems?
In UNIX, there are three timestamps associated with every inode
- data modification time
- inode modification time
- data access time  <----
The last access time is a weird one, because it turns every read access into a write to the inode
This wasn't a problem with traditional metal oxide mechanical hard drives, but with SSDs it is a big problem
- because writes are much more expensive and wear out the device
  (hence wear leveling)
the "noatime" option means never update the access timestamp, so it becomes invalid
"relatime" means update access timestamp only occassionally (once a day) and to make sure it is always equal or newer than the modified timestamp
If you have lots of small files, on some filesystems you can get to a point where you can't create new files even though there are free blocks
- because you've run out of pre-allocated inodes, and the filesystem doesn't
  support dynamically adding them
On mechanical hard drives, inode access always slowed things down because they normally caused the drive head to move between the inode location and data blocks
  - this is why inode data is aggressively cached, even though caching can lead to corrupted filesystem data
    (journaled filesystems are there to make inode caching more reliable)
How does GFS deal with metadata?
- centralized in the master node
- trade-off is no support for small files
Ceph wants to be POSIX compliant
- so arbitrary support for small and large files
- standard file semantics
Ceph directly addresses the metadata problem in a few ways
- metadata cluster to separate the load, ensure reliability
- CRUSH - remove the need for clients to talk to metadata servers
  for every file update
    - slows down clients because for regular file operations they need to
      talk to the OSDs and metadata servers
- dynamic subtree partitioning - change which metadata servers are responsible for which filesystem subtrees based on load
So CRUSH is a hash function for determining where data is placed
  - map files to objects
naive way of doing this would be like a standard hash function for a hash table
- this is what is in the "uniform" method of object placement
- problem with this is any changes to cluster structure could require all data to be moved
- so it has other placement strategies, such as the binary tree one,
  where placements are much more stable
</pre>

Latest revision as of 18:27, 13 March 2023

Discussion Questions

  • List all the terms and acronyms in the Ceph paper and discus their meaning and relationship with each other.
  • To what degree is Ceph POSIX compliant? Is there a cost for this?
  • Discuss Figures 1-3 in Crush, what does each say?

Notes

Ceph & CRUSH
------------

The big insight of this work is that there are different approaches to managing metadata in distributed filesystems

Is metadata a problem in single-device filesystems?

In UNIX, there are three timestamps associated with every inode
 - data modification time
 - inode modification time
 - data access time  <----

The last access time is a weird one, because it turns every read access into a write to the inode

This wasn't a problem with traditional metal oxide mechanical hard drives, but with SSDs it is a big problem
 - because writes are much more expensive and wear out the device
   (hence wear leveling)

the "noatime" option means never update the access timestamp, so it becomes invalid

"relatime" means update access timestamp only occassionally (once a day) and to make sure it is always equal or newer than the modified timestamp

If you have lots of small files, on some filesystems you can get to a point where you can't create new files even though there are free blocks
 - because you've run out of pre-allocated inodes, and the filesystem doesn't
   support dynamically adding them

On mechanical hard drives, inode access always slowed things down because they normally caused the drive head to move between the inode location and data blocks
  - this is why inode data is aggressively cached, even though caching can lead to corrupted filesystem data
    (journaled filesystems are there to make inode caching more reliable)

How does GFS deal with metadata?
 - centralized in the master node
 - trade-off is no support for small files

Ceph wants to be POSIX compliant
 - so arbitrary support for small and large files
 - standard file semantics

Ceph directly addresses the metadata problem in a few ways
 - metadata cluster to separate the load, ensure reliability
 - CRUSH - remove the need for clients to talk to metadata servers
   for every file update
     - slows down clients because for regular file operations they need to
       talk to the OSDs and metadata servers
 - dynamic subtree partitioning - change which metadata servers are responsible for which filesystem subtrees based on load

So CRUSH is a hash function for determining where data is placed
  - map files to objects

naive way of doing this would be like a standard hash function for a hash table
 - this is what is in the "uniform" method of object placement
 - problem with this is any changes to cluster structure could require all data to be moved
 - so it has other placement strategies, such as the binary tree one,
   where placements are much more stable