DistOS 2015W Session 11

From Soma-notes

BigTable

  • Google System used for storing data of various Google Products, for instance Google Analytics, Google Finance, Orkut, Personalized Search, Writely, Google Earth and many more
  • Big table is
    • Sparse
    • Persistant
    • Muti dimensional Sorted Map
  • It is indexed by
    • Row Key: Every read or write of data under single row key is atomic. Each row range is called Tablet. Select Row key to get good locality for data access.
    • Column Key: Grouped into sets called Column Families. Forms basic unit of Access Control.All data stored is of same type.Syntax used: family:qualifier
    • Time Stamp:Each cell consists of multiple versions of same data which are indexed by Timestamps.In order to avoid collisions, Timestamps need to be generated by applications.
  • Big Table API: Provides functions for
    • Creating and Deleting
      • Tables
      • Column Families
    • Changing Cluster
    • Changing Table
    • Column Family metadata like Access Control Rights.
    • Set of wrappers which allow Big Data to be used both as
      • Input source
      • Output Target
  • The timestamp mechanism in BIG table helps clients to access recent versions of data with simple accessing aspects of using row and column.
  • Parallel computation and cluster management system makes BIG table flexible and highly scalable.

Dynamo

  • Amazon's Key Value Store
  • Availability is the buzz word for Dynamo. Dynamo=Availability
  • Shifted Computer Science paradigm from caring about the consistency to availability.
  • Sacrifices consistency under certain failure scenarios.
  • Treats failure handling as normal case without impact on availability and performance.
  • Data is partitioned and replicated using consistent hashing and consistency is facilitated by use of object versioning.
  • This system has certain requirements such as:
    • Query Model: Simple read and write operations to data item that are uniquely identified by a key.
    • ACID properties: Atomicity, Consistency, Isolation, Durability.
    • Efficiency: System needs to function on a commodity hardware infrastructure.
  • Service Level Agreements(SLA): They are a negotiated contract between a client and a service regarding characteristics related to systems. They are used in order to guarantee that in a bounded time period, an application can deliver it's functionality.
  • System Architecture: It consists of System Interface, Partitioning Algorithm, Replication,Data Versioning.
  • Successfully handles
    • Server Failure
    • Data Centre Failure
    • Network Partitions
  • Allows service owners to customize their own storage systems according to their storage systems to meet the desired performance, durability and consistency SLAs.
  • Building block for highly available applications.

Cassandra

  • Facebook's storage system to fulfil needs of the Inbox Search Problem
  • Partitions data across the cluster using consistent hashing.
  • Distributed multi dimensional map indexed by a key
  • In it's data model:
    • Columns grouped together into sets called column families. Column Families further of 2 types:
      • Simple column families
      • Super column families
  • API consists of :
    • Insert
    • Get
    • Delete
  • System Architecture consists of :
    • Partitioning: Takes place using consistent hashing
    • Replication: Each item replicated at n hosts where "n" is the replication factor configured per system.
    • Membership: Cluster membership is based on Scuttle butt which is a highly efficient anti-entropy Gossip based mechanism.The Membership further has sub part such as:
      • Failure Detection
    • Bootstrapping
    • Scaling the cluster
  • It can run cheap commodity hardware and handle high throughput
  • Its multiple usable structure makes it very scalable

Spanner

  • Google's scalable, multi version, globally distributed database.
  • Has been built on top of the Google's Big table.
  • Provided data consistency and Supports SQL like Interface.
  • Uses a separate high-reliability time service to guarantee the correctness properties around concurrency control.
    • The timestamps are utilized.
  • It shares data across machines and migrates data automatically across machines
  • Data Control Functions in spanner controls latency and performance