DistOS 2014W Lecture 10: Difference between revisions
No edit summary |
36chambers (talk | contribs) |
||
(23 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
== | ==GFS and Ceph (Feb. 4)== | ||
* [http://research.google.com/archive/gfs-sosp2003.pdf Sanjay Ghemawat et al., "The Google File System" (SOSP 2003)] | |||
* [http://www.usenix.org/events/osdi06/tech/weil.html Weil et al., Ceph: A Scalable, High-Performance Distributed File System (OSDI 2006)]. | |||
== GFS == | == GFS == | ||
GFS is a distributed file system designed specifically for Google' needs and they made two assumption while designing GFS: | |||
* Very different because of the workload that it is | # Most of the Data is written in the form of appends ( write at the end of a file). | ||
** Because of the number of small files that have to be indexed for the web | # Data read from the files is read in a streaming sort of way ( read lot of data in the form of sequential access). | ||
* Don't care about latency | |||
* Mostly seeking through entire file. | Because of this, they decided to lay emphasis on better performance for sequential access. These two assumption are also the reason because of which they chose to keep the chunk size so huge (64 MB). You can easily read large blocks if you get rid of random access. Once data is written, it is rarely written '''over''' using random access. | ||
* Very different design because of the workload that it is designed for: | |||
** Because of the number of small files that have to be indexed for the web, it is no longer practical to have a file system that stores these individually. Too much overhead. Easier to store millions of objects as large files. Punts problem to userspace, incl. record delimitation. | |||
* Don't care about latency | |||
** surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency. | |||
* Mostly seeking (sequentially) through entire file. | |||
* Paper from 2003, mentions still using 100BASE-T links. | * Paper from 2003, mentions still using 100BASE-T links. | ||
* Data-heavy, metadata light. Contacting the metadata server is a rare event. | * Data-heavy, metadata light (unlike Ceph). Contacting the metadata server is a rare event. | ||
* | * Consider hardware failures as normal operating conditions: | ||
** All the replication | ** uses commodity hardware | ||
** Data checksumming | ** All the replication (!) | ||
** Data checksumming (few file systems do checksums) | |||
* Performance degrades for small random access workload; use other filesystem. | * Performance degrades for small random access workload; use other filesystem. | ||
* Path of least resistance to scale, not to do something super CS-smart. | * Path of least resistance to scale, not to do something super CS-smart. | ||
* Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer. | * Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer. | ||
* The paper seems to lack any mention of security. This FS probably could only exist on a trusted network. | |||
* Implements interface similar to POSIX, but not the full standard. | |||
** '''create, delete, open, close, read, write''' | |||
** Unique operations too: '''snapshot''' which is low cost file duplication and '''record append''' | |||
== | == How other filesystems compare to GFS and Ceph == | ||
* Other File Systems: AFS, NFS, Plan 9, traditional Unix | |||
* Data and metadata are held together. | * Data and metadata are held together. | ||
** | ** They did not optimize for different access patterns: | ||
*** Data → big, long transfers | *** Data → big, long transfers | ||
*** Metadata → small, low latency | *** Metadata → small, low latency | ||
** Can't scale separately | ** Can't scale separately | ||
* | |||
* Designed for lower latency | |||
* (Mostly) designed for POSIX semantics | |||
** how the requirements that lead to the ‘standard’ evolved | |||
* Assumed that a file is a fraction of the size of a server | |||
** eg. files on a Unix system were meant to be text files. | |||
** Huge files spread over many servers not even in the cards for NFS | ** Huge files spread over many servers not even in the cards for NFS | ||
** Meant for small problems, not web-scale | ** Meant for small problems, not web-scale | ||
Line 36: | Line 52: | ||
**** Their strategy is to copy the internet to index it | **** Their strategy is to copy the internet to index it | ||
**** Insane → insane filesystem | **** Insane → insane filesystem | ||
* | **** One file may span multiple servers | ||
* Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have | * Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have the scale of GFS or CEPH. | ||
* Point-to-point access; much less load-balancing, even in AFS | * Point-to-point access; much less load-balancing, even in AFS | ||
** One server to service multiple clients. | |||
** Single point of entry, single point of failure, bottleneck | ** Single point of entry, single point of failure, bottleneck | ||
* No notion of data replication. | |||
* Less focus on fault tolerance | |||
** No notion of data replication. | |||
* Reliability was a property of the host, not the network | |||
==Ceph== | ==Ceph== | ||
* Ceph is crazy and tries to do everything | |||
* GFS was very specifically designed to work in a limited scenario, under certain specific conditions, whereas CEPH is sort of generic solution- for how to build a scalable distributed file system | |||
* Achieves high performance, relaibility and availabily through three design features: decoupled data and metadata, dynamically distributed meta data, reliable autonomic distributed object storage. | |||
** Decoupled data and meta data: Metadata operations (open, close) happen to metadata clusters, clients interact directly with OSD's for IO. | |||
** Distributed Meta Data: Meta data operations make up a lot of work load. Ceph distributes this workload to many Meta Data Servers (MDS) to maintail a file hierarchy. | |||
** automic object storage: OSD's organise amongst themselves, taking advantage of their onboard CPU and Memory. Ceph delegates datamigration, replication, failure detection, recovery, to the cluster of OSDs. | |||
* Distributed Meta Data | |||
** Unlike GFS | |||
** Clusters of MDSes. | |||
** Utilizes Dynamic Subtree partitioning: Dynamically mapped subtrees of directories to MDSes. Workloads for every subtree are monitored. Subtrees assigned to MDSes accordingly, in a coarse way. | |||
* Near-Posix like interface: selectively extend interface while relaxing consistency semantics. | |||
** ex: ''readdirplus'' is an extension which optimizes for a common sequence of operations: ''readdir'' followed by multiple ''stats''. This requires brief caching to improve performance which may let small concurrent changes to go unnoticed. | |||
* Object Storage Devices (OSDs) have some intelligence (unlike GFS), and autonomously distribute the data, rather than being controlled by a master. | |||
** Uses EBOFS (instead of ext3). Implemented in user space to avoid dealing with kernel issues. Aggressively schedules disk writes. | |||
** Uses hashing in the distribution process to '''uniformly''' distribute data | |||
** The actual algorithm for distributing data is as follows: | |||
*** file + offset → hash(object ID) → CRUSH(placement group) → OSD | |||
** Each client has knowledge of the entire storage network | |||
** Tracks failure groups (same breaker, switch, etc.), hot data, etc. | |||
** Number of replicas is changeable on the fly, but the placement group is not | |||
*** For example, if every client on the planet is accessing the same file, you can scale out for that data. | |||
** You don't ask where to go, you just go, which makes this very scalable | |||
Any distributed file system that aims to be scalable, need to cut down the number of messages floating around, instead of the actual data transfer, which is what Ceph aims to do with the CRUSH function. basically Client or OSD just need to be aware of this CRUSH algorithm(function) and they can find the location of a file on their own (instead of asking a master server about it), so basically it eliminates the traditional File allocation list approach. | |||
* CRUSH is sufficiently advanced to be called magic. | |||
** O(log n) of the size of the data | |||
** CPUs stupidly fast, so the above is of minimal overhead | |||
*** the network, despite being fast, has latency, etc. | |||
*** Computation scales much better than communication. | |||
* Storage is composed of variable-length atoms | |||
*A very fun video to learn about ceph and OSD file systems - | |||
https://www.youtube.com/watch?v=C3lxGuAWEWU | |||
= Class Discussion = | |||
== File Size == | |||
In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server. | |||
== Segue on drives and sequential access following GFS section == | |||
* Structure of GFS does match some other modern systems: | |||
** Hard drives are like parallel tapes, very suited for streaming. | |||
** Flash devices are log-structured too, but have an abstracting firmware. | |||
*** They do erasure in bulk, in the '''background'''. | |||
*** Used to be we needed specialized FS for [http://en.wikipedia.org/wiki/Memory_Technology_Device MTDs] to get better performance; though now we have better micro-controllers in some embedded systems to abstract away the hardware. | |||
* Architectures that start big, often end up in the smallest things, such as in small storage devices. | |||
== Lookups vs hashing == | |||
One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”. |
Latest revision as of 16:04, 24 April 2014
GFS and Ceph (Feb. 4)
- Sanjay Ghemawat et al., "The Google File System" (SOSP 2003)
- Weil et al., Ceph: A Scalable, High-Performance Distributed File System (OSDI 2006).
GFS
GFS is a distributed file system designed specifically for Google' needs and they made two assumption while designing GFS:
- Most of the Data is written in the form of appends ( write at the end of a file).
- Data read from the files is read in a streaming sort of way ( read lot of data in the form of sequential access).
Because of this, they decided to lay emphasis on better performance for sequential access. These two assumption are also the reason because of which they chose to keep the chunk size so huge (64 MB). You can easily read large blocks if you get rid of random access. Once data is written, it is rarely written over using random access.
- Very different design because of the workload that it is designed for:
- Because of the number of small files that have to be indexed for the web, it is no longer practical to have a file system that stores these individually. Too much overhead. Easier to store millions of objects as large files. Punts problem to userspace, incl. record delimitation.
- Don't care about latency
- surprising considering it's Google, the guys who change the TCP IW standard recommendations for latency.
- Mostly seeking (sequentially) through entire file.
- Paper from 2003, mentions still using 100BASE-T links.
- Data-heavy, metadata light (unlike Ceph). Contacting the metadata server is a rare event.
- Consider hardware failures as normal operating conditions:
- uses commodity hardware
- All the replication (!)
- Data checksumming (few file systems do checksums)
- Performance degrades for small random access workload; use other filesystem.
- Path of least resistance to scale, not to do something super CS-smart.
- Google used to re-index every month, swapping out indexes. Now, it's much more online. GFS is now just a layer to support a more dynamic layer.
- The paper seems to lack any mention of security. This FS probably could only exist on a trusted network.
- Implements interface similar to POSIX, but not the full standard.
- create, delete, open, close, read, write
- Unique operations too: snapshot which is low cost file duplication and record append
How other filesystems compare to GFS and Ceph
- Other File Systems: AFS, NFS, Plan 9, traditional Unix
- Data and metadata are held together.
- They did not optimize for different access patterns:
- Data → big, long transfers
- Metadata → small, low latency
- Can't scale separately
- They did not optimize for different access patterns:
- Designed for lower latency
- (Mostly) designed for POSIX semantics
- how the requirements that lead to the ‘standard’ evolved
- Assumed that a file is a fraction of the size of a server
- eg. files on a Unix system were meant to be text files.
- Huge files spread over many servers not even in the cards for NFS
- Meant for small problems, not web-scale
- Google has a copy of the publicly accessible internet
- Their strategy is to copy the internet to index it
- Insane → insane filesystem
- One file may span multiple servers
- Google has a copy of the publicly accessible internet
- Even mainframes, scale-up solutions, ultra-reliable systems, with data sets bigger than RAM don't have the scale of GFS or CEPH.
- Point-to-point access; much less load-balancing, even in AFS
- One server to service multiple clients.
- Single point of entry, single point of failure, bottleneck
- Less focus on fault tolerance
- No notion of data replication.
- Reliability was a property of the host, not the network
Ceph
- Ceph is crazy and tries to do everything
- GFS was very specifically designed to work in a limited scenario, under certain specific conditions, whereas CEPH is sort of generic solution- for how to build a scalable distributed file system
- Achieves high performance, relaibility and availabily through three design features: decoupled data and metadata, dynamically distributed meta data, reliable autonomic distributed object storage.
- Decoupled data and meta data: Metadata operations (open, close) happen to metadata clusters, clients interact directly with OSD's for IO.
- Distributed Meta Data: Meta data operations make up a lot of work load. Ceph distributes this workload to many Meta Data Servers (MDS) to maintail a file hierarchy.
- automic object storage: OSD's organise amongst themselves, taking advantage of their onboard CPU and Memory. Ceph delegates datamigration, replication, failure detection, recovery, to the cluster of OSDs.
- Distributed Meta Data
- Unlike GFS
- Clusters of MDSes.
- Utilizes Dynamic Subtree partitioning: Dynamically mapped subtrees of directories to MDSes. Workloads for every subtree are monitored. Subtrees assigned to MDSes accordingly, in a coarse way.
- Near-Posix like interface: selectively extend interface while relaxing consistency semantics.
- ex: readdirplus is an extension which optimizes for a common sequence of operations: readdir followed by multiple stats. This requires brief caching to improve performance which may let small concurrent changes to go unnoticed.
- Object Storage Devices (OSDs) have some intelligence (unlike GFS), and autonomously distribute the data, rather than being controlled by a master.
- Uses EBOFS (instead of ext3). Implemented in user space to avoid dealing with kernel issues. Aggressively schedules disk writes.
- Uses hashing in the distribution process to uniformly distribute data
- The actual algorithm for distributing data is as follows:
- file + offset → hash(object ID) → CRUSH(placement group) → OSD
- Each client has knowledge of the entire storage network
- Tracks failure groups (same breaker, switch, etc.), hot data, etc.
- Number of replicas is changeable on the fly, but the placement group is not
- For example, if every client on the planet is accessing the same file, you can scale out for that data.
- You don't ask where to go, you just go, which makes this very scalable
Any distributed file system that aims to be scalable, need to cut down the number of messages floating around, instead of the actual data transfer, which is what Ceph aims to do with the CRUSH function. basically Client or OSD just need to be aware of this CRUSH algorithm(function) and they can find the location of a file on their own (instead of asking a master server about it), so basically it eliminates the traditional File allocation list approach.
- CRUSH is sufficiently advanced to be called magic.
- O(log n) of the size of the data
- CPUs stupidly fast, so the above is of minimal overhead
- the network, despite being fast, has latency, etc.
- Computation scales much better than communication.
- Storage is composed of variable-length atoms
- A very fun video to learn about ceph and OSD file systems -
https://www.youtube.com/watch?v=C3lxGuAWEWU
Class Discussion
File Size
In Anil’s opinion “how file system size compares to the server storage size?” is a key parameter that distinguishes GFS, NFS designs from the early file systems NFS, AFS, Plan 9. In the early files system designs, file system size was a fraction of the server storage size where as in GFS and Ceph the file system size can be of several times magnitude than that of the server.
Segue on drives and sequential access following GFS section
- Structure of GFS does match some other modern systems:
- Hard drives are like parallel tapes, very suited for streaming.
- Flash devices are log-structured too, but have an abstracting firmware.
- They do erasure in bulk, in the background.
- Used to be we needed specialized FS for MTDs to get better performance; though now we have better micro-controllers in some embedded systems to abstract away the hardware.
- Architectures that start big, often end up in the smallest things, such as in small storage devices.
Lookups vs hashing
One key aspect in the Ceph design is the attempt to replace communication with computation by using hashing based mechanism CRUSH. Following line from Anil epitomizes the general approach that is followed in the field of Computer Science “If one abstraction does not work stick another one in”.