A few years back, as other NoSQL players such as MongoDB exploded on the scene, Apache Cassandra's star was fading. The company that developed it, Facebook, had dumped it. Its community seemed to be lagging. But Cassandra's fortunes changed when Netflix decided to move from Oracle in its own data centers to Cassandra in the Amazon cloud. Not long after, Facebook picked up Cassandra again, this time to power its Instagram acquisition. Reddit, Twitter, and WebEx all use Cassandra in some capacity.
One of Cassandra's challenges is that it's almost in a category of its own. Like HBase, it's a column-family database, but it also has fundamental differences. It's a different species altogether than NoSQL document databases such as MongoDB or Couchbase, or key-value-pair databases such as DynamoDB, Redis, or Riak.
[ Andrew C. Oliver answers the question on everyone's mind: Which freaking database should I use? | Also on InfoWorld: The time for NoSQL standards is now | Get a digest of the key stories each day in the InfoWorld Daily newsletter. ]
What's a column-family database anyhow?
A column-family database stores data as row keys and sets of tuples. This looks very much like a table, and often people familiar with relational databases are led down a path of misunderstanding, especially since Google's famous column-family implementation is called BigTable. Don't be fooled: Column-family database "tables" do not act like RDBMS tables.
For one thing, they are basically schema-less, in that you can easily add another column at will. Unlike RDBMS tables that are somewhat optimized for storage and need to know a reasonable size for each row, column families are optimized for distribution. That rowkey serves more than convenient lookup; it also allows the system to effectively shard the data around the cluster. See below for the structure of a standard column family if expressed as JSON (names can be arbitrary).
ColumnFamilyName {
rowkey1 = {column1:"value1", column2:"value2"},
rowkey2 = {column1:"value2.1", column2:"value2.2", column3:"value2.3"},
rowkey3 = {column2:"value3.2", column3:"value3.3"},
rowkey4 = {column1:"value4.1", column3:"value4.3"}
}
To compound matters, this is a simple or "standard" column family. It is fairly limited by itself. Many column family databases, including Cassandra, include the concept of a supercolumn and a supercolumn family. Structurally, it is a bit like turducken in turducken. The example below shows a supercolumn family containing column families.
SuperColumnFamily {
ColumnFamily1 {
cf1rowkey1 = {column1:"value1", column2:"value2"},
cf1rowkey2 = {column1:"value2.1", column2:"value2.2", column3:"value2.3"},
cf1rowkey3 = {column2:"value3.2", column3:"value3.3"},
cf1rowkey4 = {column1:"value4.1", column3:"value4.3"}
},
ColumnFamily2 {
cf2rowkey1 = {column1:"value1", column2:"value2"},
cf2rowkey2 = {column1:"value2.1", column2:"value2.2", column3:"value2.3"},
cf2rowkey3 = {column2:"value3.2", column3:"value3.3"},
cf2rowkey4 = {column1:"value4.1", column3:"value4.3"}
}
}
Moreover, you'll find other structures such as composite columns and other distinctions, including static and dynamic. There are also special column types like counters and timestamps.