DistOS 2014W Lecture 14: Difference between revisions
Sandy.boon (talk | contribs) |
|||
(6 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
=OceanStore= | =OceanStore= | ||
* [http://homeostasis.scs.carleton.ca/~soma/distos/fall2008/oceanstore-sigplan.pdf John Kubiatowicz et al., "OceanStore: An Architecture for Global-Scale Persistent Storage" (2000)] | |||
* [http://homeostasis.scs.carleton.ca/~soma/distos/fall2008/fast2003-pond.pdf Sean Rhea et al., "Pond: the OceanStore Prototype" (2003)] | |||
* [http://oceanstore.cs.berkeley.edu/info/overview.html Project Overview] | |||
==Keywords== | |||
Highly available, universally available, utility business model, untrusted servers, nomadic data, promiscuous caching, immutable version-based archival storage, highly persistent, pond, tapestry (DHT), broken dreams. | |||
==What is the dream?== | ==What is the dream?== | ||
The dream was to create a persistent storage system that had high availability and was universally accessibly--a global, ubiquitous persistent data storage solution. OceanStore was meant to be utility managed by multiple parties, with no one party having total control/monopoly over the system. | The dream was to create a persistent storage system that had high availability and was universally accessibly--a global, ubiquitous persistent data storage solution. OceanStore was meant to be utility managed by multiple parties, with no one party having total control/monopoly over the system. | ||
The basic assumption made by the designers of OceanStore, however, was that none of the servers could be trusted. It would be built over the open internet. To support this, the system held only opaque/encrypted data. As such, the system could be used for more than files (e.g., for whole databases). | |||
The basic assumption | The second basic assumption was that the system utilized nomadic data. Information was divorced from a physical location. Information was stored and replicated everywhere. It used promiscuous caching to cache information near its users. This is unlike with NFS and AFS where only specific servers cache the data. | ||
To support the goal of high availability, there was a high amount of redundancy and fault-tolerance. For high persistence, everything was archived--nothing was ever truly deleted. This can be likened to working in version control with "Commits". This is possibly due to the realization that the easier it is to delete things, the easier it is to lose things. | |||
==Why did the dream die?== | ==Why did the dream die?== | ||
The biggest reason that caused the OceanStore dream to die was the assumption of mistrusting all the actors--everything else they did was right. This assumption, however, caused the system to become needlessly complicated as they had to rebuild ''everything'' to accommodate this assumption. This was also unrealistic as this is not an assumption that is generally made (i.e., it is normally assumed that at least some of the actors can be trusted). Other successful distributed systems are built on a more trusted model. In short, the solution that accommodates untrusted actors assumption is just too expensive. | The biggest reason that caused the OceanStore dream to die was the assumption of mistrusting all the actors--everything else they did was right. This assumption, however, caused the system to become needlessly complicated as they had to rebuild ''everything'' to accommodate this assumption. This was also unrealistic as this is not an assumption that is generally made (i.e., it is normally assumed that at least some of the actors can be trusted). | ||
Other successful distributed systems are built on a more trusted model. Every node in Dynamo, BigTable, etc. is trusted. In short, the solution that accommodates untrusted actors assumption is just too expensive. | |||
=== Technology === | === Technology === | ||
Line 20: | Line 31: | ||
===Use Cases=== | ===Use Cases=== | ||
A subset of the features outlined for OceanStore already exist. For example, Blackberry and Google offer similar services. These current services are owned by one company, however, not many providers. You can also not sell back your services as a user (e.g., you can't sell your extra storage back to the utility). | A subset of the features outlined for OceanStore already exist. For example, Blackberry and Google offer similar groupware services (eg. email, contact lists, etc.) These current services are owned by one company, however, not many providers. You can also not sell back your services as a user (e.g., you can't sell your extra storage back to the utility). | ||
==Pond: What insights?== | ==Pond: What insights?== | ||
In short: they actually built it! However, due to the untrusted assumption, they can't assume the use of any infrastructure, causing them to rebuild ''everything''! It was built over the internet with Tapestry (routing) and GUID for object identification (object naming scheme). | In short: they actually built it! However, due to the untrusted assumption, they can't assume the use of any infrastructure, causing them to rebuild ''everything''! It was built over the internet with Tapestry (dynamic routing) and GUID for object identification (object naming scheme). | ||
==Benchmarks== | ==Benchmarks== | ||
In short: the system had really good read speed, really bad write speed. Absolutely everything is expensive and there is high latency. | |||
===Storage overhead=== | ===Storage overhead=== | ||
One general question was how much are they increasing the storage needed to implement their storage model? The answer: a factor of 4.8x the space is needed (you'll have 1/5th the storage). While this is expensive, it does have a good value as your data is backed up, replicated, etc. However, it does cause one to consider how important it is to make an update as you burn more storage space as more updates are made. | |||
===Update performance=== | ===Update performance=== | ||
None of the data is mutated--it is diffed and archived (ie. a "commit). You are essentially creating a new version of an object and then distributing that object to all nodes. | |||
=== | ==Other stuff== | ||
* | '''Byzantine fault tolerance''' | ||
* Byzantine fault tolerance has the assumption that there are malicious actors in your system. | |||
* byzantine fault tolerant network replicates the data in such a way that even if m nodes out of total n nodes,in a network,fail, you would still be able to recover the whole data. but as you increase the value of number m, the required network messages to be exchanges also increases, so there is a tradeoff. | |||
* You are assuming certain actors are malicious. | |||
'''Bitcoin''' | |||
* Trusted vs Untrusted. | |||
* It is considered to be untrusted but it takes huge amount of trust when exchanges are made. | |||
'''Bloom Filters''' | |||
Bloom Filter are lossy structures which can answer a membership query -is x in set S? It was invented by Burton Bloom in 1970 and has | |||
been widely used for Web caching.The storage requirement of a BF falls several orders of magnitude below the lower bounds of error-free encoding structures. This space efficiency comes at a cost - Bloom filters may give a false positive result - that is, they can report that a certain element is present in a set, while it is actually not there. however the reverse is not true, that is, if bloom filters report an element non-existent in a set, it will never turn out that, that element actually does belong to the set. | |||
Bloom filters have found their way into distributed file systems for meta data management. They can be used to let a MDS(meta data server) server decide if a particular file's belongs to it or some other MDS. if a hit is found in any of the MDS' bloom filter, the metadata request is forwarded to that particular server, it is highly probable that MDS will actually have that metadata. This approach can achieve few very desirable traits for Meta data management like - | |||
** | * Scalable service. | ||
* | * Zero metadata migration. | ||
* | * Balancing the load of metadata accesses. | ||
* Flexibility of storing the metadata of a file on any MS. | |||
Sadly Bloom filter, which be answer all the metadata management problems can see lot of churn in case of a false hit or a mis-hit and studies have shown that the false hit rate of a BF array actually increases with the filter size. This false hit rate can be improved by allowing more bits/file in filter but then it would cause memory issues when number of files is in millions. | |||
==What's worth salvaging from the dream?== | ==What's worth salvaging from the dream?== | ||
Some of the good things that we can salvage are using spare resources in other locations. It can also be noted that similar routing systems are used in large peer to peer systems. | |||
==How to read a research paper== | ==How to read a research paper== | ||
# Start with the Introduction to figure out what the problem is. | |||
# See/read through the related work/background for context of the paper. | |||
# Go to the conclusion and focus on the results (i.e., figure out what they actually did). | |||
# Fill in the gaps by reading specific parts of the body. | |||
Latest revision as of 13:23, 24 April 2014
OceanStore
- John Kubiatowicz et al., "OceanStore: An Architecture for Global-Scale Persistent Storage" (2000)
- Sean Rhea et al., "Pond: the OceanStore Prototype" (2003)
- Project Overview
Keywords
Highly available, universally available, utility business model, untrusted servers, nomadic data, promiscuous caching, immutable version-based archival storage, highly persistent, pond, tapestry (DHT), broken dreams.
What is the dream?
The dream was to create a persistent storage system that had high availability and was universally accessibly--a global, ubiquitous persistent data storage solution. OceanStore was meant to be utility managed by multiple parties, with no one party having total control/monopoly over the system.
The basic assumption made by the designers of OceanStore, however, was that none of the servers could be trusted. It would be built over the open internet. To support this, the system held only opaque/encrypted data. As such, the system could be used for more than files (e.g., for whole databases).
The second basic assumption was that the system utilized nomadic data. Information was divorced from a physical location. Information was stored and replicated everywhere. It used promiscuous caching to cache information near its users. This is unlike with NFS and AFS where only specific servers cache the data.
To support the goal of high availability, there was a high amount of redundancy and fault-tolerance. For high persistence, everything was archived--nothing was ever truly deleted. This can be likened to working in version control with "Commits". This is possibly due to the realization that the easier it is to delete things, the easier it is to lose things.
Why did the dream die?
The biggest reason that caused the OceanStore dream to die was the assumption of mistrusting all the actors--everything else they did was right. This assumption, however, caused the system to become needlessly complicated as they had to rebuild everything to accommodate this assumption. This was also unrealistic as this is not an assumption that is generally made (i.e., it is normally assumed that at least some of the actors can be trusted).
Other successful distributed systems are built on a more trusted model. Every node in Dynamo, BigTable, etc. is trusted. In short, the solution that accommodates untrusted actors assumption is just too expensive.
Technology
As outlined above, the trust model (read: fundamentally untrusted model) is the most attractive feature which ultimately killed it. The untrusted assumption introduced a huge burden on the system, forcing technical limitations which made OceanStore uncompetitive in comparison to other solutions. It is just much more easy and convenient to trust a given system. It should be noted that every system is compromisable, despite this mistrust.
The public key system also reduces usability--if a user loses their key, they are completely out of luck and would need to acquire a new key. This also means that, if you wanted to remove their access over an object, you would have to re-encrypt the object with a new key and provide that key to said user, who would then have access to the object.
With regards to the security, there is no security mechanism on the server side. The server can not know who is accessing the data. On the economic side, the economic model is unconvincing with the way it is defined. The authors suggest that a collection of companies will host OceanStore servers and consumers will buy capacity (not unlike web-hosting today).
Use Cases
A subset of the features outlined for OceanStore already exist. For example, Blackberry and Google offer similar groupware services (eg. email, contact lists, etc.) These current services are owned by one company, however, not many providers. You can also not sell back your services as a user (e.g., you can't sell your extra storage back to the utility).
Pond: What insights?
In short: they actually built it! However, due to the untrusted assumption, they can't assume the use of any infrastructure, causing them to rebuild everything! It was built over the internet with Tapestry (dynamic routing) and GUID for object identification (object naming scheme).
Benchmarks
In short: the system had really good read speed, really bad write speed. Absolutely everything is expensive and there is high latency.
Storage overhead
One general question was how much are they increasing the storage needed to implement their storage model? The answer: a factor of 4.8x the space is needed (you'll have 1/5th the storage). While this is expensive, it does have a good value as your data is backed up, replicated, etc. However, it does cause one to consider how important it is to make an update as you burn more storage space as more updates are made.
Update performance
None of the data is mutated--it is diffed and archived (ie. a "commit). You are essentially creating a new version of an object and then distributing that object to all nodes.
Other stuff
Byzantine fault tolerance
- Byzantine fault tolerance has the assumption that there are malicious actors in your system.
- byzantine fault tolerant network replicates the data in such a way that even if m nodes out of total n nodes,in a network,fail, you would still be able to recover the whole data. but as you increase the value of number m, the required network messages to be exchanges also increases, so there is a tradeoff.
- You are assuming certain actors are malicious.
Bitcoin
- Trusted vs Untrusted.
- It is considered to be untrusted but it takes huge amount of trust when exchanges are made.
Bloom Filters
Bloom Filter are lossy structures which can answer a membership query -is x in set S? It was invented by Burton Bloom in 1970 and has been widely used for Web caching.The storage requirement of a BF falls several orders of magnitude below the lower bounds of error-free encoding structures. This space efficiency comes at a cost - Bloom filters may give a false positive result - that is, they can report that a certain element is present in a set, while it is actually not there. however the reverse is not true, that is, if bloom filters report an element non-existent in a set, it will never turn out that, that element actually does belong to the set. Bloom filters have found their way into distributed file systems for meta data management. They can be used to let a MDS(meta data server) server decide if a particular file's belongs to it or some other MDS. if a hit is found in any of the MDS' bloom filter, the metadata request is forwarded to that particular server, it is highly probable that MDS will actually have that metadata. This approach can achieve few very desirable traits for Meta data management like -
- Scalable service.
- Zero metadata migration.
- Balancing the load of metadata accesses.
- Flexibility of storing the metadata of a file on any MS.
Sadly Bloom filter, which be answer all the metadata management problems can see lot of churn in case of a false hit or a mis-hit and studies have shown that the false hit rate of a BF array actually increases with the filter size. This false hit rate can be improved by allowing more bits/file in filter but then it would cause memory issues when number of files is in millions.
What's worth salvaging from the dream?
Some of the good things that we can salvage are using spare resources in other locations. It can also be noted that similar routing systems are used in large peer to peer systems.
How to read a research paper
- Start with the Introduction to figure out what the problem is.
- See/read through the related work/background for context of the paper.
- Go to the conclusion and focus on the results (i.e., figure out what they actually did).
- Fill in the gaps by reading specific parts of the body.