DistOS 2015W Session 12: Difference between revisions
(10 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
* [http://static.usenix.org/legacy/events/osdi10/tech/full_papers/Beaver.pdf Beaver et al., "Finding a needle in Haystack: Facebook’s photo storage" (OSDI 2010)] | |||
* [https://www.usenix.org/conference/osdi12/technical-sessions/presentation/gordon Gordon et al., "COMET: Code Offload by Migrating Execution Transparently" (OSDI 2012)] | |||
* [https://www.usenix.org/conference/osdi14/technical-sessions/presentation/muralidhar Muralidhar et al., "f4: Facebook's Warm BLOB Storage System" (OSDI 2014)] | |||
* [https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zhang Zhang et al., "Customizable and Extensible Deployment for Mobile/Cloud Applications" (OSDI 2014)] | |||
Session 12 had two separate themes running through their papers. | |||
The first was the storage of large volumes of data that would never be modified, rarely be deleted, and read with varying frequency distributions. This was a specific sub problem of the more general challenge of high performance scalable & reliable distributed storage, and once more leads the solvers at hand (Facebook) to design specialized systems to the exploit the specifics of the sub problem for superior performance. | |||
The second was a return to code offloading mechanisms that could turn programs distributed, in a new modern context. COMET works by taking existing smartphone applications and splicing the user interaction and computation between the phone and an external server. Sapphire provides a whole cluster of deployment modules that one can mix & match and apply to conformant programs to turn them into distributed applications. | |||
=Haystack= | =Haystack= | ||
* Facebook's Photo Application Storage System. | * Facebook's Photo Application Storage System. | ||
* Previous Fb photo storage based on NFS design. The reason why NFS | * Previous Fb photo storage based on NFS design. The reason why NFS didn't work is that it took 3 file-system accesses per logical photo read. Haystack only needs one access. | ||
*Main goals of Haystack: | *Main goals of Haystack: | ||
** High throughput with low latency | ** High throughput with low latency. It uses one disk operation to provide these. | ||
**Fault tolerance | **Fault tolerance | ||
**Cost effective | **Cost effective | ||
** | **Simple | ||
*Facebook | *Facebook stored all images in haystack with a CDN in front to cache hot data. Haystack still needs to be fast since accessing non-cached data is still common. | ||
*Haystack reduces the memory used for ''filesystem metadata'' | *Haystack reduces the memory used for ''filesystem metadata'' | ||
*It has 2 types of metadata: | *It has 2 types of metadata: | ||
Line 16: | Line 27: | ||
**Haystack Directory | **Haystack Directory | ||
**Haystack Cache | **Haystack Cache | ||
*Pitchfork and bulk sync were used to tolerate faults. theTfault tolerance works in a very profound way to make haystack feasible and reliable | |||
=Comet= | =Comet= | ||
*Introduced the concept of distributed shared memory (DSM). In a DSM, RAMs from multiple servers would appear as if they are all belonging to one server, allowing better scalability for caching. | *Introduced the concept of distributed shared memory (DSM). In a DSM, RAMs from multiple servers would appear as if they are all belonging to one server, allowing better scalability for caching. | ||
*DSM provides advatage over RPC(Remote Procedure Call) including multi threading suuport, thread migration during execution. | |||
*client and server model maintain consistency using DSM | |||
*Comet model works by offloading the computation intensive process from the mobile to only one server. | *Comet model works by offloading the computation intensive process from the mobile to only one server. | ||
*The offloading process works by passing the computation intensive process to the server and hold it on the mobile device. Once the process on the server completes, it returns the results and the handle back to the mobile device. In other words, the process does not get physically offloaded to the server but instead it runs on the server and stopped on the mobile device. | *The offloading process works by passing the computation intensive process to the server and hold it on the mobile device. Once the process on the server completes, it returns the results and the handle back to the mobile device. In other words, the process does not get physically offloaded to the server but instead it runs on the server and stopped on the mobile device. | ||
*In Java Memory Model, memory reads and writes are partially ordered by a transitive "happens-before" relationship. | |||
**Java Virtual Machine can keep track of data flow which DSM use to keep the heap, stacks, and locking states consistent across endpoints. | |||
=F4= | =F4= | ||
* Warm Blob Storage System. | * Warm Blob Storage System. | ||
** Warm Blob is a immutable data that | ** Warm Blob is a store for large quantities of immutable data that isn't frequently accessed, but must still be available. | ||
** F4 reduce the space usage | ** Built to reduce the overhead of haystack for old data that doesn't need to be quite as available. Generally data that is a few months old is moved from Haystack to Warm Blob. | ||
*Reed Solomon coding basically use(10,4) | ** F4 reduce the space usage of Haystack from a replication factor of 3.6 to 2.8 or 2.1 using Reed Solomon coding and XOR coding respectively but still provides consistency. | ||
*XOR coding use(2,1) across three data center and use 1.5 expansion factor which gives 1.5*1.4= 2.1 effective replication factor | ** Less robust to data center failures as a result. | ||
*Reed Solomon coding basically use(10,4) which means 10 data and 4 parity blocks in a stripe, and can thus tolerate losing up to 4 blocks which means it can tolerate 4 rack failure and use 1.4 expansion factor.Two copies of this would be 2* 1.4= 2.8 effective replication factor. | |||
*XOR coding use(2,1) across three data center and use 1.5 expansion factor which gives 1.5*1.4= 2.1 effective replication factor. | |||
*The caching mechanism provides the reduction in load on storage system and it makes BLOB scaleable. | |||
=Sapphire= | =Sapphire= | ||
Line 32: | Line 52: | ||
*Sapphire does not show their scalability boundaries. There is no such distributed system model that can be “one size fits all”, most probably it will break in some large scale distributed application. | *Sapphire does not show their scalability boundaries. There is no such distributed system model that can be “one size fits all”, most probably it will break in some large scale distributed application. | ||
*Reaching this global distributed system that address all the distributed OS use cases will be a cumulative work of many big bodies and building it block by block and then this system will evolve by putting all these different building blocks together. In other words, reaching a global distributed system will come from a “bottom up not top down approach” [Somayaji, 2015]. | *Reaching this global distributed system that address all the distributed OS use cases will be a cumulative work of many big bodies and building it block by block and then this system will evolve by putting all these different building blocks together. In other words, reaching a global distributed system will come from a “bottom up not top down approach” [Somayaji, 2015]. | ||
*The concept of separate application logic from deployment logic helps programmers in making a flexible system. The other important part that makes it as a scalable system was that it is object based and could be integrated with any object oriented language. |
Latest revision as of 14:23, 20 April 2015
- Beaver et al., "Finding a needle in Haystack: Facebook’s photo storage" (OSDI 2010)
- Gordon et al., "COMET: Code Offload by Migrating Execution Transparently" (OSDI 2012)
- Muralidhar et al., "f4: Facebook's Warm BLOB Storage System" (OSDI 2014)
- Zhang et al., "Customizable and Extensible Deployment for Mobile/Cloud Applications" (OSDI 2014)
Session 12 had two separate themes running through their papers.
The first was the storage of large volumes of data that would never be modified, rarely be deleted, and read with varying frequency distributions. This was a specific sub problem of the more general challenge of high performance scalable & reliable distributed storage, and once more leads the solvers at hand (Facebook) to design specialized systems to the exploit the specifics of the sub problem for superior performance.
The second was a return to code offloading mechanisms that could turn programs distributed, in a new modern context. COMET works by taking existing smartphone applications and splicing the user interaction and computation between the phone and an external server. Sapphire provides a whole cluster of deployment modules that one can mix & match and apply to conformant programs to turn them into distributed applications.
Haystack
- Facebook's Photo Application Storage System.
- Previous Fb photo storage based on NFS design. The reason why NFS didn't work is that it took 3 file-system accesses per logical photo read. Haystack only needs one access.
- Main goals of Haystack:
- High throughput with low latency. It uses one disk operation to provide these.
- Fault tolerance
- Cost effective
- Simple
- Facebook stored all images in haystack with a CDN in front to cache hot data. Haystack still needs to be fast since accessing non-cached data is still common.
- Haystack reduces the memory used for filesystem metadata
- It has 2 types of metadata:
- Application metadata
- File System metadata
- The architecture consists of 3 components:
- Haystack Store
- Haystack Directory
- Haystack Cache
- Pitchfork and bulk sync were used to tolerate faults. theTfault tolerance works in a very profound way to make haystack feasible and reliable
Comet
- Introduced the concept of distributed shared memory (DSM). In a DSM, RAMs from multiple servers would appear as if they are all belonging to one server, allowing better scalability for caching.
- DSM provides advatage over RPC(Remote Procedure Call) including multi threading suuport, thread migration during execution.
- client and server model maintain consistency using DSM
- Comet model works by offloading the computation intensive process from the mobile to only one server.
- The offloading process works by passing the computation intensive process to the server and hold it on the mobile device. Once the process on the server completes, it returns the results and the handle back to the mobile device. In other words, the process does not get physically offloaded to the server but instead it runs on the server and stopped on the mobile device.
- In Java Memory Model, memory reads and writes are partially ordered by a transitive "happens-before" relationship.
- Java Virtual Machine can keep track of data flow which DSM use to keep the heap, stacks, and locking states consistent across endpoints.
F4
- Warm Blob Storage System.
- Warm Blob is a store for large quantities of immutable data that isn't frequently accessed, but must still be available.
- Built to reduce the overhead of haystack for old data that doesn't need to be quite as available. Generally data that is a few months old is moved from Haystack to Warm Blob.
- F4 reduce the space usage of Haystack from a replication factor of 3.6 to 2.8 or 2.1 using Reed Solomon coding and XOR coding respectively but still provides consistency.
- Less robust to data center failures as a result.
- Reed Solomon coding basically use(10,4) which means 10 data and 4 parity blocks in a stripe, and can thus tolerate losing up to 4 blocks which means it can tolerate 4 rack failure and use 1.4 expansion factor.Two copies of this would be 2* 1.4= 2.8 effective replication factor.
- XOR coding use(2,1) across three data center and use 1.5 expansion factor which gives 1.5*1.4= 2.1 effective replication factor.
- The caching mechanism provides the reduction in load on storage system and it makes BLOB scaleable.
Sapphire
- Represents a building block towards building this global distributed systems. The main critique to it is that it didn’t present a specific use case upon which their design is built upon.
- Sapphire does not show their scalability boundaries. There is no such distributed system model that can be “one size fits all”, most probably it will break in some large scale distributed application.
- Reaching this global distributed system that address all the distributed OS use cases will be a cumulative work of many big bodies and building it block by block and then this system will evolve by putting all these different building blocks together. In other words, reaching a global distributed system will come from a “bottom up not top down approach” [Somayaji, 2015].
- The concept of separate application logic from deployment logic helps programmers in making a flexible system. The other important part that makes it as a scalable system was that it is object based and could be integrated with any object oriented language.