DistOS 2014W Lecture 14: Difference between revisions

From Soma-notes
Boliu (talk | contribs)
 
(18 intermediate revisions by 4 users not shown)
Line 1: Line 1:
=OceanStore=
=OceanStore=
* [http://homeostasis.scs.carleton.ca/~soma/distos/fall2008/oceanstore-sigplan.pdf John Kubiatowicz et al., "OceanStore: An Architecture for Global-Scale Persistent Storage" (2000)]
* [http://homeostasis.scs.carleton.ca/~soma/distos/fall2008/fast2003-pond.pdf Sean Rhea et al., "Pond: the OceanStore Prototype" (2003)]
* [http://oceanstore.cs.berkeley.edu/info/overview.html Project Overview]
==Keywords==
Highly available, universally available, utility business model, untrusted servers, nomadic data, promiscuous caching, immutable version-based archival storage, highly persistent, pond, tapestry (DHT), broken dreams.


==What is the dream?==
==What is the dream?==
* High availabitility, universally accessible.
The dream was to create a persistent storage system that had high availability and was universally accessibly--a global, ubiquitous persistent data storage solution. OceanStore was meant to be utility managed by multiple parties, with no one party having total control/monopoly over the system.  
* Utility managed by multiple parties.
* Highly redundant, fault tolerant
* Basic assumption was that servers would NOT be trusted.


* Highly persistent
The basic assumption made by the designers of OceanStore, however, was that none of the servers could be trusted. It would be built over the open internet. To support this, the system held only opaque/encrypted data. As such, the system could be used for more than files (e.g., for whole databases).  
** Everything archived
** Everything saved, nothing deleted. "Commits"


* Service was untrusted
The second basic assumption was that the system utilized nomadic data. Information was divorced from a physical location. Information was stored and replicated everywhere. It used promiscuous caching to cache information near its users. This is unlike with NFS and AFS where only specific servers cache the data.
** Held opaque/encrypted data.
* Would have been used for more than files. (eg. DB's, etc.)


*Global ubiquitous persistent data storage
To support the goal of high availability, there was a high amount of redundancy and fault-tolerance. For high persistence, everything was archived--nothing was ever truly deleted. This can be likened to working in version control with "Commits". This is possibly due to the realization that the easier it is to delete things, the easier it is to lose things.
*Nomadic data
*Untrusted infrastructure
*Cannot delete data, universal archive
**The easier you delete stuff, the easier you lose stuff


==Why did the dream die?==
==Why did the dream die?==


* Biggest reason it died was it's assumption of mistrusting the actors.
The biggest reason that caused the OceanStore dream to die was the assumption of mistrusting all the actors--everything else they did was right. This assumption, however, caused the system to become needlessly complicated as they had to rebuild ''everything'' to accommodate this assumption. This was also unrealistic as this is not an assumption that is generally made (i.e., it is normally assumed that at least some of the actors can be trusted).  
** Everything else they did was right.
 
* Other successful distributed systems are built on a more trusted model.
Other successful distributed systems are built on a more trusted model. Every node in Dynamo, BigTable, etc. is trusted. In short, the solution that accommodates untrusted actors assumption is just too expensive.


=== Technology ===
=== Technology ===
* The trust model is the most attractive feature which ultimately killed it.
As outlined above, the trust model (read: fundamentally untrusted model) is the most attractive feature which ultimately killed it. The untrusted assumption introduced a huge burden on the system, forcing technical limitations which made OceanStore uncompetitive in comparison to other solutions. It is just much more easy and convenient to trust a given system. It should be noted that every system is compromisable, despite this mistrust.
** The untrusted assumption was a huge burden on the system. Forced technical limitations made them uncompetitive.
 
** It is just easier to trust a given system. More convenient.
The public key system also reduces usability--if a user loses their key, they are completely out of luck and would need to acquire a new key. This also means that, if you wanted to remove their access over an object, you would have to re-encrypt the object with a new key and provide that key to said user, who would then have access to the object.
** Every system is compromisable despite this mistrust
 
* Pub key system reduces usability
With regards to the security, there is no security mechanism on the server side. The server can not know who is accessing the data. On the economic side, the economic model is unconvincing with the way it is defined. The authors suggest that a collection of companies will host OceanStore servers and consumers will buy capacity (not unlike web-hosting today).
** If you loose your key, you're S.O.L.
*security
**there is no security mechanism in servers side.
**can not now who access the data
*economic side
**The economic model is unconvincing as defined. The authors suggest that a collection of companies will host OceanStore servers, and consumers will buy capacity (not unlike web-hosting of today).


===Use Cases===
===Use Cases===
* Subset of the features already exist
A subset of the features outlined for OceanStore already exist. For example, Blackberry and Google offer similar groupware services (eg. email, contact lists, etc.) These current services are owned by one company, however, not many providers. You can also not sell back your services as a user (e.g., you can't sell your extra storage back to the utility).
** Blackberry and Google offer similar services.
** These current services owned by one company, not many providers.
** Can not sell back your services as a user.
*** ex. Can not sell your extra storage back to the utility.


==Pond: What insights?==
==Pond: What insights?==
 
In short: they actually built it! However, due to the untrusted assumption, they can't assume the use of any infrastructure, causing them to rebuild ''everything''! It was built over the internet with Tapestry (dynamic routing) and GUID for object identification (object naming scheme).
* They actually built it.
* Can't assume the use of any infrastructure, so they rebuild everything!
** Built over the internet.
** Tapestry (routing).
** GUID for object indentification. Object naming scheme.


==Benchmarks==
==Benchmarks==
* Really good read speed, really bad write speed.
In short: the system had really good read speed, really bad write speed. Absolutely everything is expensive and there is high latency.


===Storage overhead===
===Storage overhead===
* How much are they increasing the storage needed to implement their storage model.
One general question was how much are they increasing the storage needed to implement their storage model? The answer: a factor of 4.8x the space is needed (you'll have 1/5th the storage). While this is expensive, it does have a good value as your data is backed up, replicated, etc. However, it does cause one to consider how important it is to make an update as you burn more storage space as more updates are made.  
* Factor of 4.8x the space needed (you'll have 1/5th the storage)
* Expensive, but good value (data is backed up, replicated, etc..)


===Update performance===
===Update performance===
* No data is mutated. It is diffed and archived.
None of the data is mutated--it is diffed and archived (ie. a "commit). You are essentially creating a new version of an object and then distributing that object to all nodes.
* Creating a new version of an object and distributing that object.
 
==Other stuff==
'''Byzantine fault tolerance'''
* Byzantine fault tolerance has the assumption that there are malicious actors in your system.
* byzantine fault tolerant network replicates the data in such a way that even if m nodes out of total n nodes,in a network,fail, you would still be able to recover the whole data. but as you increase the value of number m, the required network messages to be exchanges also increases, so there is a tradeoff.
* You are assuming certain actors are malicious.
'''Bitcoin'''
* Trusted vs Untrusted.
* It is considered to be untrusted but it takes huge amount of trust when exchanges are made.
'''Bloom Filters'''


===Benchmarks in a nutshell===
Bloom Filter are lossy structures which can answer a membership query -is x in set S? It was invented by Burton Bloom in 1970  and has
* Everything is expensive!
been widely used for Web caching.The storage requirement of a BF falls several orders of magnitude below the lower bounds of error-free encoding structures. This space efficiency comes at a cost - Bloom filters may give a false positive result - that is, they can report that a certain element is present in a set, while it is actually not there. however the reverse is not true, that is, if bloom filters report an element non-existent in a set, it will never turn out that, that element actually does belong to the set.
* High latency
Bloom filters have found their way into distributed file systems for meta data management. They can be used to let a MDS(meta data server) server decide if a particular file's belongs to it or some other MDS. if a hit is found in any of the MDS' bloom filter, the metadata request is forwarded to that particular server, it is highly probable that MDS will actually have that metadata. This approach can achieve few very desirable traits for Meta data management like -
* Scalable service.
* Zero metadata migration.
* Balancing the load of metadata accesses.
* Flexibility of storing the metadata of a file on any MS.


==Other stuff==
Sadly Bloom filter, which be answer all the metadata management problems can see lot of churn  in case of a false hit or a mis-hit and studies have shown that the false hit rate of a BF array actually increases with the filter size. This false hit rate can be improved by allowing more bits/file in filter but then it would cause memory issues when number of files is in millions.
* Byzantine fault tolerance
** Assuming certain actors are malicious


==What's worth salvaging from the dream?==
==What's worth salvaging from the dream?==
* Using spare resources in other locations.
Some of the good things that we can salvage are using spare resources in other locations. It can also be noted that similar routing systems are used in large peer to peer systems.


==How to read a research paper==
==How to read a research paper==
* Start with Intro
# Start with the Introduction to figure out what the problem is.
** Figure out what the problem is
# See/read through the related work/background for context of the paper.
* then see the related work for context
# Go to the conclusion and focus on the results (i.e., figure out what they actually did).
* then go to conclusion. Focus on results.
# Fill in the gaps by reading specific parts of the body.
* then fill in the gaps by reading specific parts of the body

Latest revision as of 13:23, 24 April 2014

OceanStore

Keywords

Highly available, universally available, utility business model, untrusted servers, nomadic data, promiscuous caching, immutable version-based archival storage, highly persistent, pond, tapestry (DHT), broken dreams.

What is the dream?

The dream was to create a persistent storage system that had high availability and was universally accessibly--a global, ubiquitous persistent data storage solution. OceanStore was meant to be utility managed by multiple parties, with no one party having total control/monopoly over the system.

The basic assumption made by the designers of OceanStore, however, was that none of the servers could be trusted. It would be built over the open internet. To support this, the system held only opaque/encrypted data. As such, the system could be used for more than files (e.g., for whole databases).

The second basic assumption was that the system utilized nomadic data. Information was divorced from a physical location. Information was stored and replicated everywhere. It used promiscuous caching to cache information near its users. This is unlike with NFS and AFS where only specific servers cache the data.

To support the goal of high availability, there was a high amount of redundancy and fault-tolerance. For high persistence, everything was archived--nothing was ever truly deleted. This can be likened to working in version control with "Commits". This is possibly due to the realization that the easier it is to delete things, the easier it is to lose things.

Why did the dream die?

The biggest reason that caused the OceanStore dream to die was the assumption of mistrusting all the actors--everything else they did was right. This assumption, however, caused the system to become needlessly complicated as they had to rebuild everything to accommodate this assumption. This was also unrealistic as this is not an assumption that is generally made (i.e., it is normally assumed that at least some of the actors can be trusted).

Other successful distributed systems are built on a more trusted model. Every node in Dynamo, BigTable, etc. is trusted. In short, the solution that accommodates untrusted actors assumption is just too expensive.

Technology

As outlined above, the trust model (read: fundamentally untrusted model) is the most attractive feature which ultimately killed it. The untrusted assumption introduced a huge burden on the system, forcing technical limitations which made OceanStore uncompetitive in comparison to other solutions. It is just much more easy and convenient to trust a given system. It should be noted that every system is compromisable, despite this mistrust.

The public key system also reduces usability--if a user loses their key, they are completely out of luck and would need to acquire a new key. This also means that, if you wanted to remove their access over an object, you would have to re-encrypt the object with a new key and provide that key to said user, who would then have access to the object.

With regards to the security, there is no security mechanism on the server side. The server can not know who is accessing the data. On the economic side, the economic model is unconvincing with the way it is defined. The authors suggest that a collection of companies will host OceanStore servers and consumers will buy capacity (not unlike web-hosting today).

Use Cases

A subset of the features outlined for OceanStore already exist. For example, Blackberry and Google offer similar groupware services (eg. email, contact lists, etc.) These current services are owned by one company, however, not many providers. You can also not sell back your services as a user (e.g., you can't sell your extra storage back to the utility).

Pond: What insights?

In short: they actually built it! However, due to the untrusted assumption, they can't assume the use of any infrastructure, causing them to rebuild everything! It was built over the internet with Tapestry (dynamic routing) and GUID for object identification (object naming scheme).

Benchmarks

In short: the system had really good read speed, really bad write speed. Absolutely everything is expensive and there is high latency.

Storage overhead

One general question was how much are they increasing the storage needed to implement their storage model? The answer: a factor of 4.8x the space is needed (you'll have 1/5th the storage). While this is expensive, it does have a good value as your data is backed up, replicated, etc. However, it does cause one to consider how important it is to make an update as you burn more storage space as more updates are made.

Update performance

None of the data is mutated--it is diffed and archived (ie. a "commit). You are essentially creating a new version of an object and then distributing that object to all nodes.

Other stuff

Byzantine fault tolerance

  • Byzantine fault tolerance has the assumption that there are malicious actors in your system.
  • byzantine fault tolerant network replicates the data in such a way that even if m nodes out of total n nodes,in a network,fail, you would still be able to recover the whole data. but as you increase the value of number m, the required network messages to be exchanges also increases, so there is a tradeoff.
  • You are assuming certain actors are malicious.

Bitcoin

  • Trusted vs Untrusted.
  • It is considered to be untrusted but it takes huge amount of trust when exchanges are made.

Bloom Filters

Bloom Filter are lossy structures which can answer a membership query -is x in set S? It was invented by Burton Bloom in 1970 and has been widely used for Web caching.The storage requirement of a BF falls several orders of magnitude below the lower bounds of error-free encoding structures. This space efficiency comes at a cost - Bloom filters may give a false positive result - that is, they can report that a certain element is present in a set, while it is actually not there. however the reverse is not true, that is, if bloom filters report an element non-existent in a set, it will never turn out that, that element actually does belong to the set. Bloom filters have found their way into distributed file systems for meta data management. They can be used to let a MDS(meta data server) server decide if a particular file's belongs to it or some other MDS. if a hit is found in any of the MDS' bloom filter, the metadata request is forwarded to that particular server, it is highly probable that MDS will actually have that metadata. This approach can achieve few very desirable traits for Meta data management like -

  • Scalable service.
  • Zero metadata migration.
  • Balancing the load of metadata accesses.
  • Flexibility of storing the metadata of a file on any MS.

Sadly Bloom filter, which be answer all the metadata management problems can see lot of churn in case of a false hit or a mis-hit and studies have shown that the false hit rate of a BF array actually increases with the filter size. This false hit rate can be improved by allowing more bits/file in filter but then it would cause memory issues when number of files is in millions.

What's worth salvaging from the dream?

Some of the good things that we can salvage are using spare resources in other locations. It can also be noted that similar routing systems are used in large peer to peer systems.

How to read a research paper

  1. Start with the Introduction to figure out what the problem is.
  2. See/read through the related work/background for context of the paper.
  3. Go to the conclusion and focus on the results (i.e., figure out what they actually did).
  4. Fill in the gaps by reading specific parts of the body.