Difference between revisions of "DistOS-2011W Cassandra and Hamachi"

From Soma-notes
Jump to navigation Jump to search
Line 13: Line 13:
== The Data Model ==
== The Data Model ==


[[File:Column_Family.PNG|300px|thumb|left|Cassandra Column Family[[#References|<sup>1</sup>]]]]
[[File:Column_Family.PNG|300px|thumb|left|Cassandra Column Family (([[#References|Source #1]]))]]
Cassandra takes a different approach compared to RDBMSs when it comes to how data is conceptualized and managed. In a standard RDBMS, data is ''normalized'' into a series of '''tables''' with a set number of columns. For the most part, the types and number of these columns will only rarely change as the need arises. For a later analogy, we'll refer to this model as a "Narrow Row" model. Cassandra operates on a system of '''''Column Families''''' which are analogous to a '''tables''' in an RDBMS, but is much more flexible. Column Families (hereafter referred to as ''CFs'') are a loose grouping of common data keyed by some value for each row in question and are grouped together into '''Keyspaces'''. However, as opposed to the static feel of a RDBMS table, the number of columns or even the types of columns can and will change under Cassandra. That isn't to say there is no form of schema in Cassandra, because there is a simple type-enforcement schema system in place for each ''CF''. To complete the analogy, the model for Cassandra can be seen as "Wide Row", with the number of columns growing however you please. Different rows in the same CF can even have a different number of columns. Internally, Cassandra uses a "always appending" model when writing data to disk, similar to Google's Bigtable or GFS.
Cassandra takes a different approach compared to RDBMSs when it comes to how data is conceptualized and managed. In a standard RDBMS, data is ''normalized'' into a series of '''tables''' with a set number of columns. For the most part, the types and number of these columns will only rarely change as the need arises. For a later analogy, we'll refer to this model as a "Narrow Row" model. Cassandra operates on a system of '''''Column Families''''' which are analogous to a '''tables''' in an RDBMS, but is much more flexible. Column Families (hereafter referred to as ''CFs'') are a loose grouping of common data keyed by some value for each row in question and are grouped together into '''Keyspaces'''. However, as opposed to the static feel of a RDBMS table, the number of columns or even the types of columns can and will change under Cassandra. That isn't to say there is no form of schema in Cassandra, because there is a simple type-enforcement schema system in place for each ''CF''. To complete the analogy, the model for Cassandra can be seen as "Wide Row", with the number of columns growing however you please. Different rows in the same CF can even have a different number of columns. Internally, Cassandra uses a "always appending" model when writing data to disk, similar to Google's Bigtable or GFS.


Line 49: Line 49:
* Riak
* Riak
* CouchDB
* CouchDB
= Hamachi =
Hamachi is a project aimed at easing the complication of creating and managing a Virtual Private Network. VPNs have historically been used to join external machines onto private networks. A travelling salesman or branch office might use one to gain access to a company's internal resources over possibly insecure lines. With a properly configured VPN encrypted traffic is tunneled over an insecure connection without concern that the contents could be inspected enroute. Hamachi has its origins in the modern playing of classic-era games that don't support WAN based gameplay. By creating a VPN with friends the game would be unable to tell that the machines were not all physically together.
== Caveat ==
While the original implementation of Hamachi was open source the project is currently part of a commercial vendor called LogMeIn. The reason I chose to use this project was simply to fulfill a need for a quick VPN setup with a minimal amount of fuss. The performance and security of my implementation project would undoubtably be improved by an open source and mediator-free VPN solution such as [[http://openvpn.net/|OpenVPN]].


=References=
=References=
# [http://oreilly.com/catalog/0636920010852| Cassandra: The definitive guide] by Eben Hewitt &copy; 2011
# [http://oreilly.com/catalog/0636920010852| Cassandra: The definitive guide] by Eben Hewitt &copy; 2011

Revision as of 23:48, 28 February 2011

In the beginning

The Internet has seen remarkable growth over the last few years, both technologically and socially. The demand for real-time information has increased at exponential rates and has put existing information systems to the test. For this paper I decided to look at two new software projects that have emerged over the last few years: Cassandra and Hamachi. Cassandra is a distributed database that evolved from the needs of websites such as Twitter and Facebook, whose need for frequent and short updates was too taxing on standard RDBMSs. Hamachi, which is in part a commercial project now, started as an open-source project aimed at creating zero-config VPNs. The project is now part of LogMeIn but is still free to use for non-commercial purposes.

In normal implementation circumstances you would not find these two projects paired together, but the concept of having a distributed database over a reasonably secure connection was one I thought was worth exploring. Over the course of this paper I'll discuss both projects in a reasonable amount of detail before detailing my implementation experiment and experience.

Cassandra

Cassandra is a distributed, decentralized, scalable and fault tolerant database system (or at least it claims to be) currently under development by the Apache foundation. It shares some features from both the Bigtable and Dynamo projects, but has more focus on decentralization. It sports a tuneable consistency and replication system which, together with a flexible schema system, can quickly adapt to a site or project's growing needs. Well, enough salesman talk, let's get to the details.

Over the course of this subsection I'll be covering some of the basic concepts of Cassandra and how they contribute to the "bigger picture", as well as some implementation and installation details.

The Data Model

Cassandra Column Family ((Source #1))

Cassandra takes a different approach compared to RDBMSs when it comes to how data is conceptualized and managed. In a standard RDBMS, data is normalized into a series of tables with a set number of columns. For the most part, the types and number of these columns will only rarely change as the need arises. For a later analogy, we'll refer to this model as a "Narrow Row" model. Cassandra operates on a system of Column Families which are analogous to a tables in an RDBMS, but is much more flexible. Column Families (hereafter referred to as CFs) are a loose grouping of common data keyed by some value for each row in question and are grouped together into Keyspaces. However, as opposed to the static feel of a RDBMS table, the number of columns or even the types of columns can and will change under Cassandra. That isn't to say there is no form of schema in Cassandra, because there is a simple type-enforcement schema system in place for each CF. To complete the analogy, the model for Cassandra can be seen as "Wide Row", with the number of columns growing however you please. Different rows in the same CF can even have a different number of columns. Internally, Cassandra uses a "always appending" model when writing data to disk, similar to Google's Bigtable or GFS.

The downside to such a data model is that Cassandra has no actual query language. There is no equivalent to SQL in Cassandra. There is also no support for referential integrity, either. With such circumstances maintaining data normalization while keeping processing time to a minimum is extremely difficult and is actively frowned upon. That's right, data duplication is encouraged when working with Cassandra. Luckily, the latest version (0.7.0 as of this writing) has introduced indexes to the meta-data that can be assigned to a column. This allows for basic boolean logic queries can be constructed to filter results better.

Scaling

Cassandra Ring status

While Cassandra can operate on a single machine the power lies in setting up a Cassandra cluster. Each instance is referred to as a node within the cluster. In Cassandra nomenclature the nodes form a ring and divide the keyspace amongst all of the nodes based on a specified partitioner. There are a number of partitioners available, including some that take physical location into account.

Fault Tolerance

Cassandra employs two protocols that make up it's fault tolerance system, Gossip and Handoffs. Gossip is an extension of the usual Heartbeat model of monitoring other servers, and involves a full TCP style SYN/ACK/SYN+ACK process to verify the alive-ness of another node in the ring. When a node attempts to contact another and fails, it "Gossips" this fact to the others. Using a special accrual method for detecting downed nodes the ring can keep track of problem areas and adjust accordingly. The second half of this fault tolerance suite is the Handoff protocol. When a write destined for a downed node is received by the ring, other nearby nodes will accumulate and keep these entries until the downed node is restored. During startup a node will check if any outstanding handoff messages are destined for it and import them promptly.

Replication

One of the nice features of Cassandra is a built-in Replication system. When a Keyspace is created the replication factor dictates how many copies of column data will be spread out over the ring. When writing or reading data to the nodes the clients can specify the desired consistency level that is to be observed relative to the Keyspace's replication factor. Like most distributed databases, Cassandra purports eventual consistency.

Consistency

ONE consistency
The client will block until the first node available responds to the request.
QUORUM consistency
The client will block until a majority of nodes agree on the data to return.
ALL consistency
The client will block until all nodes would return the same data

I'll give an example of the impacts of the various replication levels later in my implementation report.

Other options

Cassandra shares some common ground with a number of other projects:

  • Bigtable
  • Dynamo
  • Riak
  • CouchDB

Hamachi

Hamachi is a project aimed at easing the complication of creating and managing a Virtual Private Network. VPNs have historically been used to join external machines onto private networks. A travelling salesman or branch office might use one to gain access to a company's internal resources over possibly insecure lines. With a properly configured VPN encrypted traffic is tunneled over an insecure connection without concern that the contents could be inspected enroute. Hamachi has its origins in the modern playing of classic-era games that don't support WAN based gameplay. By creating a VPN with friends the game would be unable to tell that the machines were not all physically together.

Caveat

While the original implementation of Hamachi was open source the project is currently part of a commercial vendor called LogMeIn. The reason I chose to use this project was simply to fulfill a need for a quick VPN setup with a minimal amount of fuss. The performance and security of my implementation project would undoubtably be improved by an open source and mediator-free VPN solution such as [[1]].

References

  1. Cassandra: The definitive guide by Eben Hewitt © 2011