DistOS 2014W Lecture 20

From Soma-notes
Revision as of 01:44, 20 April 2014 by Cdelahou (talk | contribs) (Removed line breaks)

Cassandra

Cassandra is essentially running a BigTable interface on top of a Dynamo infrastructure. BigTable uses GFS' built-in replication and Chubby for locking. Cassandra uses gossip algorithms (similar to Dynamo): Scuttlebutt.

A brief look at Open Source

Initialy Anil talked about google versus facebook approach to technologies.

  • Google developed its technology internally and used for competitive advantage.
  • Facebook developed its technology in open source manner. They needed to create an open source community to keep up.
  • He talked little bit about licences. With GPL3 you have to provide code with binary. In AGPL additional service also be given with source code.

While discussing Hbase versus Cassandra discussed why two projects with same notion are supported? Apache as a community. For any tool in CS, particularly software tools, its actually important to have more than one good implementation. Only time it doesn't happen because of market realities.

Hadoop is a set of technologies that represent the open source equivalent of Google's infrastructure

  • Cassandra -> ???
  • HBase -> BigTable
  • HDFS -> GFS
  • Zookeeper -> Chubby

Back to Cassandra

  • Cassandra is basically you take a key value store system like Dynamo and then you extend to look like BigTable.
  • Not just a key value store. It is a multi dimensional map. You can look up different columns, etc. The data is more structured than a Key-Value store.
  • In a key value store, you can only look up the key. Cassandra is much richer than this.

Bigtable vs. Cassandra:

  • Bigtable and Cassandra exposes similar APIs.
  • Cassandra seems to be lighter weight.
  • Bigtable depends on GFS. Cassandra depends on server's file system. Anil feels cassandra cluster is easy to setup.
  • Bigtable is designed for stream oriented batch processing . Cassandra is for handling online/realtime/highspeed stuff.

Schema design is explained in inbox example. It does not give clarity about how table will look like. Anil thinks they store lot data with messages which makes table crappy.

Apache Zookeeper is used for distributed configuration. It will also bootstrap and configure a new node. It is similar to Chubby. Zookeeper is for node level information. The Gossip protocol is more about key partitioning information and distributing that information amongst nodes.

Cassandra uses a modified version of the Accrual Failure Detector. The idea of an Accrual Failure Detection is that failure detection module emits a value which represents a suspicion level for each of monitored nodes. The idea is to express the value of phi� on a scale that is dynamically adjusted to react network and load conditions at the monitored nodes.

Files are written to disk in an sequential way and are never mutated. This way, reading a file does not require locks. Garbage collection takes care of deletion.

Cassandra writes in an immutable way like functional programming. There is no assignment in functional programming. It tries to eliminate side effects. Data is just binded you associate a name with a value.


Cassandra -

  • Uses consistent hashing (like most DHTs)
  • Lighter weight
  • All most of the readings are part of Apache
  • More designed for online updates for interactive lower latency
  • Once they write to disk they only read back
  • Scalable multi master database with no single point of failure
  • Reason for not giving out the complete detail on the table schema
  • Probably not just inbox search
  • All data in one row of a table
  • Its not a key-value store. Big blob of data.
  • Gossip based protocol - Scuttlebutt. Every node is aware of overy other.
  • Fixed circular ring
  • Consistency issue not addressed at all. Does writes in an immutable way. Never change them.

Older style network protocol - token rings What sort of computational systems avoid changing data? Systems talking about implementing functional like semantics.

Comet

The major idea behind Comet is triggers/callbacks. There is an extensive literature in extensible operating systems, basically adding code to the operating system to better suit my application. "Generally, extensible systems suck." -User:Soma

The presentation video of Comet

Comet seeks to greatly expand the application space for key-value storage systems through application-specific customization.Comet storage object is a <key,value> pair.Each Comet node stores a collection of active storage objects (ASOs) that consist of a key, a value, and a set of handlers. Comet handlers run as a result of timers or storage operations, such as get or put, allowing an ASO to take dynamic, application-specific actions to customize its behaviour. Handlers are written in a simple sandboxed extension language, providing properties of safety and isolation.ASO can modify its environment, monitor its execution,and make dynamic decisions about its state.

Researchers try to provide the ability to extend a DHT without requiring a substantial investment of effort to modify its implementation.They try to implement to isolation and safety using restricting system access,restricting resource consumption and restricting within-Comet communication.

  • if someone wants to understand the consistent hashing in detail, here is a blog which explains it really well, this blog has other great posts in the field of distributed system as well -

http://loveforprogramming.quora.com/Distributed-Systems-Part-1-A-peek-into-consistent-hashing *