DistOS 2023W 2023-03-13: Difference between revisions
Created page with "==Discussion Questions== * List all the terms and acronyms in the Ceph paper and discus their meaning and relationship with each other. * To what degree is Ceph POSIX compliant? Is there a cost for this? * Discuss Figures 1-3 in Crush, what does each say?" |
No edit summary |
||
Line 4: | Line 4: | ||
* To what degree is Ceph POSIX compliant? Is there a cost for this? | * To what degree is Ceph POSIX compliant? Is there a cost for this? | ||
* Discuss Figures 1-3 in Crush, what does each say? | * Discuss Figures 1-3 in Crush, what does each say? | ||
==Notes== | |||
<pre> | |||
Ceph & CRUSH | |||
------------ | |||
The big insight of this work is that there are different approaches to managing metadata in distributed filesystems | |||
Is metadata a problem in single-device filesystems? | |||
In UNIX, there are three timestamps associated with every inode | |||
- data modification time | |||
- inode modification time | |||
- data access time <---- | |||
The last access time is a weird one, because it turns every read access into a write to the inode | |||
This wasn't a problem with traditional metal oxide mechanical hard drives, but with SSDs it is a big problem | |||
- because writes are much more expensive and wear out the device | |||
(hence wear leveling) | |||
the "noatime" option means never update the access timestamp, so it becomes invalid | |||
"relatime" means update access timestamp only occassionally (once a day) and to make sure it is always equal or newer than the modified timestamp | |||
If you have lots of small files, on some filesystems you can get to a point where you can't create new files even though there are free blocks | |||
- because you've run out of pre-allocated inodes, and the filesystem doesn't | |||
support dynamically adding them | |||
On mechanical hard drives, inode access always slowed things down because they normally caused the drive head to move between the inode location and data blocks | |||
- this is why inode data is aggressively cached, even though caching can lead to corrupted filesystem data | |||
(journaled filesystems are there to make inode caching more reliable) | |||
How does GFS deal with metadata? | |||
- centralized in the master node | |||
- trade-off is no support for small files | |||
Ceph wants to be POSIX compliant | |||
- so arbitrary support for small and large files | |||
- standard file semantics | |||
Ceph directly addresses the metadata problem in a few ways | |||
- metadata cluster to separate the load, ensure reliability | |||
- CRUSH - remove the need for clients to talk to metadata servers | |||
for every file update | |||
- slows down clients because for regular file operations they need to | |||
talk to the OSDs and metadata servers | |||
- dynamic subtree partitioning - change which metadata servers are responsible for which filesystem subtrees based on load | |||
So CRUSH is a hash function for determining where data is placed | |||
- map files to objects | |||
naive way of doing this would be like a standard hash function for a hash table | |||
- this is what is in the "uniform" method of object placement | |||
- problem with this is any changes to cluster structure could require all data to be moved | |||
- so it has other placement strategies, such as the binary tree one, | |||
where placements are much more stable | |||
</pre> |
Latest revision as of 18:27, 13 March 2023
Discussion Questions
- List all the terms and acronyms in the Ceph paper and discus their meaning and relationship with each other.
- To what degree is Ceph POSIX compliant? Is there a cost for this?
- Discuss Figures 1-3 in Crush, what does each say?
Notes
Ceph & CRUSH ------------ The big insight of this work is that there are different approaches to managing metadata in distributed filesystems Is metadata a problem in single-device filesystems? In UNIX, there are three timestamps associated with every inode - data modification time - inode modification time - data access time <---- The last access time is a weird one, because it turns every read access into a write to the inode This wasn't a problem with traditional metal oxide mechanical hard drives, but with SSDs it is a big problem - because writes are much more expensive and wear out the device (hence wear leveling) the "noatime" option means never update the access timestamp, so it becomes invalid "relatime" means update access timestamp only occassionally (once a day) and to make sure it is always equal or newer than the modified timestamp If you have lots of small files, on some filesystems you can get to a point where you can't create new files even though there are free blocks - because you've run out of pre-allocated inodes, and the filesystem doesn't support dynamically adding them On mechanical hard drives, inode access always slowed things down because they normally caused the drive head to move between the inode location and data blocks - this is why inode data is aggressively cached, even though caching can lead to corrupted filesystem data (journaled filesystems are there to make inode caching more reliable) How does GFS deal with metadata? - centralized in the master node - trade-off is no support for small files Ceph wants to be POSIX compliant - so arbitrary support for small and large files - standard file semantics Ceph directly addresses the metadata problem in a few ways - metadata cluster to separate the load, ensure reliability - CRUSH - remove the need for clients to talk to metadata servers for every file update - slows down clients because for regular file operations they need to talk to the OSDs and metadata servers - dynamic subtree partitioning - change which metadata servers are responsible for which filesystem subtrees based on load So CRUSH is a hash function for determining where data is placed - map files to objects naive way of doing this would be like a standard hash function for a hash table - this is what is in the "uniform" method of object placement - problem with this is any changes to cluster structure could require all data to be moved - so it has other placement strategies, such as the binary tree one, where placements are much more stable