DistOS 2014W Lecture 19: Difference between revisions

From Soma-notes
Ronak (talk | contribs)
 
(5 intermediate revisions by 2 users not shown)
Line 2: Line 2:
== Dynamo ==
== Dynamo ==


* Key value-store.
* Query model: key-value only
* Highly available, always writable.
* Guarantee Service Level Agreements (SLA).
* 0-hop DHT: it has direct link to the destination. Has complete view of system locally. No dynamic routing.
* Dynamo sacrifices consistency under certain failure scenarios.
* Consistent hashing to partition key-space: the output range of a hash function is treated as a fixed circular space or “ring”.
* Key-space is linear and the nodes partition it.
* ”Virtual Nodes”: Each server can be responsible for more than one virtual node.
* Each data item is replicated at N hosts.
* “preference list”: The list of nodes that is responsible for storing a particular key.
* Sacrifice strong consistency for availability.
** Eventual consistency.
* Decentralized, P2P, limited administration.
* it work with 100 servers,it is not bigger.
* Application/client specific conflict resolution.
* Designed to be flexible
** "Tuneable consistency"
** Pluggable local persistence: DBD, MySQL.
Amazon's motivating use case is that at no point, in a customer's shopping cart, should any newly added item be dropped. Dynamo should be highly available and always writeable.
Amazon has an service oriented architecture. A response to a client is a composite of many services, so SLA's were a HUGE consideration when designing Dynamo. Amazon needed low latency and high availability to ensure a good user experience when aggregating all the services together.
Traditional RDBMS emphasise ACID compliance. Amazon found that ACID compliancy lead to systems with far less availability. It's hard to have consistency and availability both at the same time. See [http://en.wikipedia.org/wiki/CAP_theorem CAP Theorem]. Dynamo can, and usually does sacrifice consistency for availability. They use the terms "eventual consistency" and "tunable consistency".
Key range is partitioned according to consistent hashing algorithm,which treats the output range as a fixed circular space or “ring”. Any time a new node joins in, it takes a token which decides its position on the ring. Every node becomes owner of key range which is in between itself and the previous node on the ring, so anytime a node comes in or leaves it only affects its neighbor nodes. Dynamo has this notion of virtual node, where a machine actually can host more than one node and hence allows to adjust the load according to the machine's capability.
Dynamo uses replication to provide availability, each key-value is distributed to N-1 node (N can be configured by the application that uses Dynamo).
Each node has a complete view of the network. A node knows the key-range that every node supports.  Any time a node joins, the gossip based protocols are used to inform every node about the key range changes. This allows for Dynamo to be a 0-hop network. 0-hop means it is logically 0 hop network. IP routing is still be required to actually physically get to the node. This 0-hop approach is different from typical distributed hash tables where routing and hops are used to find the node responsible for a key (eg. Tapestry). Dynamo can do this because the system is deployed on trusted, fully known, networks.


*Key value-store.
Dynamo is deployed on trusted networks (ie. for amazon's internal applications. It doesn't have to worry about making the system secure. Compare this to Oceanstore.
*Build a distributed storage system:
*Scale
*Simple: key-value
*Highly available
*Guarantee Service Level Agreements (SLA).
*high concurrent.
* no dynamic routing.
* 0-hop DHT: means it is doe not have information when deliver packet from node to another , it has direct link to the destination
* Dynamo sacrifices consistency under certain failure scenarios.
* it has partition algorithm.
*Consistent hashing: the output range of a hash function is treated as a fixed circular space or “ring”.
*Key is linear and the nodes is partition.
*”Virtual Nodes”: Each node can be responsible for more than one virtual node.
*Each data item is replicated at N hosts.
*“preference list”: The list of nodes that is responsible for storing a particular key.
* Sacrifice strong consistency for availability
* it work with 100 servers,it is not more big.


When compared to BigTable, Dynamo typically scales to hundreds of servers, not thousands. That is not to say that Dynamo can not scale, we need to understand the difference between the use cases for BigTable and Dynamo.


Dynamo is a key value storage system, built by Amazon folks to work in online shopping cart scenario. It should be able to scale up and it should be highly available (at no point should any item added to shopping cart should drop because some node goes down, that would be unacceptable). Dynamo can, and does sacrifice consistency for availability. It is to be noted that- in traditional RDBMS emphasis is on consistency and providing ACID compliant interface and it is not possible to have consistency and availability both at the same time,one has to be sacrificed and Dynamo's aim is to be highly available. Key range is partitioned according to consistent hashing algorithm,which treats the output range as a fixed circular space or “ring”. Any time a new node joins in, it takes a token which decides its position on the ring.Every node becomes owner of key range which is in between itself and the previous node on the ring, so anytime a node comes in or leaves it only affects its neighbor nodes. Dynamo has this notion of virtual node, where a machine actually can host more than one node and hence allows to adjust the load according to the machine's capability.
Dynamo uses replication to provide availability, each key-value is distributed to N-1 node (N can be configured by the application that uses Dynamo). Each node knows the key-range that every node supports(any time a node joins in gossip based protocols are used to inform every node about the key range changes) , this allows for Dynamo to be a 0-hop network, which is different from Distributed hash tables ( where routing and hops are used to find the node responsible for a key, Example - OceanStore), it is to be noted that 0-hop means it is logically 0 hop network,IP routing would still be required to actually physically get to the node.
Dynamo is supposed to work in-house ( for amazon's internal applications) and hence doesn't have to worry about making the system secure. When compared to BigTable, the number of nodes here would be very less, a number in 100s, that is not to say that Dynamo can not scale, we need to understand the difference between the use cases for BigTable and Dynamo.
Any "write" that is done on any replica is never held off to serialize the updates to maintain consistency, it will eventually try to reconcile the difference between two different versions( based on the logs) if it can not do so, the conflict resolution is left to the client application which would read data from Dynamo(If there are more than versions of a replica, all the versions along with the log is passed to client and client should reconcile the changes)
Any "write" that is done on any replica is never held off to serialize the updates to maintain consistency, it will eventually try to reconcile the difference between two different versions( based on the logs) if it can not do so, the conflict resolution is left to the client application which would read data from Dynamo(If there are more than versions of a replica, all the versions along with the log is passed to client and client should reconcile the changes)


== Bigtable ==
== Bigtable ==


* Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.
* Designed to scale to a very large size
* More focused on consistency than Dynamo.
* Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability.
* A BigTable is a sparse, distributed persistent multi-dimensional sorted map.The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
* Column oriented DB.
** Streaming chunks of columns is easier than streaming entire rows.
* Data Model: rows made up of column families.Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings.
* The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing.
** Eg. Row: the page URL. Column families would either be the content, or the set of inbound links.
** Each column in a column family has copies. Timestamped.
* Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk.
* Bigtable uses the distributed Google File System (GFS) to store log and data �files.
* Tablets: Large tables broken into tablets at row boundaries and each raw Tablet holds contiguous range of sorted rows.
** Immutable b/c of GFS. Deletion happens via garbage collection.


* BigTable is a distributed storage system for managing structured data.
* Designed to scale to a very large size
* it stores the column together ,the raw is web pages and the column is the contents.
* Each pages have incoming links
* A BigTable is a sparse, distributed persistent multi-dimensional sorted map.
* it have a many columns and it is look as table.
* Each raw has arbitrary column.
* It is multi-dimension map.
* An SSTable provides a persistent,ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.
* An SSTable provides a persistent,ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.
* Large tables broken into tablets at row boundaries and each raw Tablet holds contiguous range of rows.
* Metadata operations: Create/delete tables, column families, change metadata.
* Metadata operations: Create/delete tables, column families, change metadata.


===Implementation:===
* Centralized, hierchy.
* Three major components: client library, one master server, many tablet servers.
* Master server
** Assigns tablets to tablet server.
** Detects tablet additions and removals
** garbage collection on GFS.
* balancing tablet-server load
*it handles schema changes such as table and column family creations.
* Tablet Servers
** holds tablet locations.
** Manages multiple tablets (thousands per tablet server)
** Handles I/O.
* Client Library
** What devs use.
** Caches tablet locations
=== Tablet Serving ===
* The persistent state of a tablet is stored in GFS
* Updates are committed to a commit log that stores redo records.
* the recently committed ones are stored in memory in a sorted buffer called a memtable
* the older updates are stored in a sequence of SSTables.
* To recover a tablet, a tablet server reads its metadata from the METADATA table.
* This metadata contains the list of SSTables that comprise a tablet and a set of a redo points, which are pointers into any commit logs that may contain data for the tablet.
* The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points.
* When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS.
=== Caching ===
To improve read performance, tablet servers use two levels of caching.
* The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. The Scan Cache is most useful for applications that tend to read the same data repeatedly.
* The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS. The Block Cache is useful for applications that tend to read data that is close to the data they recently read
Apart from that,Bigtable relies on a highly-available and persistent distributed lock service called Chubby.Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data to discover tablet servers and �finalize tablet server deaths to store Bigtable schema information and to store access control lists.
To reduce the number of accesses by allowing clients to specify that Bloom �filters .A Bloom �filter allows us to ask whether an SSTable might contain any data for a specifi�ed row/column pair. For certain applications, a small amount of tablet server memory used for storing Bloom
�filters drastically reduces the number of disk seeks required for read operations.
=== Consider the following ===
Can big table be used in a shopping cart type of scenario, where low latency and availability are the main focus. Can it be used like Dynamo? Yes, it can, but not as well. Big Table would have more latency because it was designed for Data procession and was not designed to work under such a scenario. Dynamo was designed for different use cases. There is no one solution that can solve all the problems in the world of distributed file systems, there is no silver bullet, no - one size fits all. File systems are usually designed for specific use cases and they work best for them, later if the need be they can be molded to work on other scenarios as well and they may provide good enough performance for the later added goals as well but they would work best for the use cases,which were the targets in the beginnings.


The question to consider is- can big table be used in a shopping cart type of scenario, where latency and availability are the main focus( or to rephrase the question- can big table be used in place of dynamo and vice- versa ). The answer is- it can be but it wouldn't be as good as dynamo at latency parameter, Dynamo would probably do a lot better than big table but the reason is that big table was not designed to work under such a scenario, its use cases were different. There is no one solution that can solve all the problems in the world of distributed file systems, there is no silver bullet, no - one size fits all. file systems are usually designed for specific use cases and they work best for them, later if the need be they can be molded to work on other scenarios as well and they may provide good enough performance for the later added goals as well but they would work best for the use cases,which were the targets in the beginnings.
* BigTable -> Highly consistent, Data Processing, Map Reduce, semi structured store
* Dynamo -> High availability, low latency, key-value store


== General talk ==
== General talk ==

Latest revision as of 01:26, 26 April 2014

Dynamo

  • Key value-store.
  • Query model: key-value only
  • Highly available, always writable.
  • Guarantee Service Level Agreements (SLA).
  • 0-hop DHT: it has direct link to the destination. Has complete view of system locally. No dynamic routing.
  • Dynamo sacrifices consistency under certain failure scenarios.
  • Consistent hashing to partition key-space: the output range of a hash function is treated as a fixed circular space or “ring”.
  • Key-space is linear and the nodes partition it.
  • ”Virtual Nodes”: Each server can be responsible for more than one virtual node.
  • Each data item is replicated at N hosts.
  • “preference list”: The list of nodes that is responsible for storing a particular key.
  • Sacrifice strong consistency for availability.
    • Eventual consistency.
  • Decentralized, P2P, limited administration.
  • it work with 100 servers,it is not bigger.
  • Application/client specific conflict resolution.
  • Designed to be flexible
    • "Tuneable consistency"
    • Pluggable local persistence: DBD, MySQL.

Amazon's motivating use case is that at no point, in a customer's shopping cart, should any newly added item be dropped. Dynamo should be highly available and always writeable.

Amazon has an service oriented architecture. A response to a client is a composite of many services, so SLA's were a HUGE consideration when designing Dynamo. Amazon needed low latency and high availability to ensure a good user experience when aggregating all the services together.

Traditional RDBMS emphasise ACID compliance. Amazon found that ACID compliancy lead to systems with far less availability. It's hard to have consistency and availability both at the same time. See CAP Theorem. Dynamo can, and usually does sacrifice consistency for availability. They use the terms "eventual consistency" and "tunable consistency".

Key range is partitioned according to consistent hashing algorithm,which treats the output range as a fixed circular space or “ring”. Any time a new node joins in, it takes a token which decides its position on the ring. Every node becomes owner of key range which is in between itself and the previous node on the ring, so anytime a node comes in or leaves it only affects its neighbor nodes. Dynamo has this notion of virtual node, where a machine actually can host more than one node and hence allows to adjust the load according to the machine's capability.

Dynamo uses replication to provide availability, each key-value is distributed to N-1 node (N can be configured by the application that uses Dynamo).

Each node has a complete view of the network. A node knows the key-range that every node supports. Any time a node joins, the gossip based protocols are used to inform every node about the key range changes. This allows for Dynamo to be a 0-hop network. 0-hop means it is logically 0 hop network. IP routing is still be required to actually physically get to the node. This 0-hop approach is different from typical distributed hash tables where routing and hops are used to find the node responsible for a key (eg. Tapestry). Dynamo can do this because the system is deployed on trusted, fully known, networks.

Dynamo is deployed on trusted networks (ie. for amazon's internal applications. It doesn't have to worry about making the system secure. Compare this to Oceanstore.

When compared to BigTable, Dynamo typically scales to hundreds of servers, not thousands. That is not to say that Dynamo can not scale, we need to understand the difference between the use cases for BigTable and Dynamo.

Any "write" that is done on any replica is never held off to serialize the updates to maintain consistency, it will eventually try to reconcile the difference between two different versions( based on the logs) if it can not do so, the conflict resolution is left to the client application which would read data from Dynamo(If there are more than versions of a replica, all the versions along with the log is passed to client and client should reconcile the changes)

Bigtable

  • Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.
  • Designed to scale to a very large size
  • More focused on consistency than Dynamo.
  • Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability.
  • A BigTable is a sparse, distributed persistent multi-dimensional sorted map.The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
  • Column oriented DB.
    • Streaming chunks of columns is easier than streaming entire rows.


  • Data Model: rows made up of column families.Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings.
  • The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing.
    • Eg. Row: the page URL. Column families would either be the content, or the set of inbound links.
    • Each column in a column family has copies. Timestamped.
  • Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk.
  • Bigtable uses the distributed Google File System (GFS) to store log and data �files.
  • Tablets: Large tables broken into tablets at row boundaries and each raw Tablet holds contiguous range of sorted rows.
    • Immutable b/c of GFS. Deletion happens via garbage collection.
  • An SSTable provides a persistent,ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.
  • Metadata operations: Create/delete tables, column families, change metadata.

Implementation:

  • Centralized, hierchy.
  • Three major components: client library, one master server, many tablet servers.
  • Master server
    • Assigns tablets to tablet server.
    • Detects tablet additions and removals
    • garbage collection on GFS.
  • balancing tablet-server load
  • it handles schema changes such as table and column family creations.
  • Tablet Servers
    • holds tablet locations.
    • Manages multiple tablets (thousands per tablet server)
    • Handles I/O.
  • Client Library
    • What devs use.
    • Caches tablet locations

Tablet Serving

  • The persistent state of a tablet is stored in GFS
  • Updates are committed to a commit log that stores redo records.
  • the recently committed ones are stored in memory in a sorted buffer called a memtable
  • the older updates are stored in a sequence of SSTables.
  • To recover a tablet, a tablet server reads its metadata from the METADATA table.
  • This metadata contains the list of SSTables that comprise a tablet and a set of a redo points, which are pointers into any commit logs that may contain data for the tablet.
  • The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points.
  • When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS.

Caching

To improve read performance, tablet servers use two levels of caching.

  • The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. The Scan Cache is most useful for applications that tend to read the same data repeatedly.
  • The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS. The Block Cache is useful for applications that tend to read data that is close to the data they recently read

Apart from that,Bigtable relies on a highly-available and persistent distributed lock service called Chubby.Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data to discover tablet servers and �finalize tablet server deaths to store Bigtable schema information and to store access control lists.

To reduce the number of accesses by allowing clients to specify that Bloom �filters .A Bloom �filter allows us to ask whether an SSTable might contain any data for a specifi�ed row/column pair. For certain applications, a small amount of tablet server memory used for storing Bloom �filters drastically reduces the number of disk seeks required for read operations.

Consider the following

Can big table be used in a shopping cart type of scenario, where low latency and availability are the main focus. Can it be used like Dynamo? Yes, it can, but not as well. Big Table would have more latency because it was designed for Data procession and was not designed to work under such a scenario. Dynamo was designed for different use cases. There is no one solution that can solve all the problems in the world of distributed file systems, there is no silver bullet, no - one size fits all. File systems are usually designed for specific use cases and they work best for them, later if the need be they can be molded to work on other scenarios as well and they may provide good enough performance for the later added goals as well but they would work best for the use cases,which were the targets in the beginnings.

  • BigTable -> Highly consistent, Data Processing, Map Reduce, semi structured store
  • Dynamo -> High availability, low latency, key-value store

General talk

  • Read the introduction and conclusion for each paper and think about cases in the paper more than look to how the author solve the problem.