CockroachDB review: A scale-out SQL database built for survival
CockroachDB is a distributed SQL database built on top of a transactional and consistent key-value store that can withstand datacenter failures
-
CockroachDB v1.1.3
Until very recently, when you shopped for a database you had to choose: Scalability or consistency? SQL databases such as MySQL guarantee strong consistency, but don’t scale well horizontally. (Manual sharding for scalability is no one’s idea of fun.) NoSQL databases such as MongoDB scale beautifully, but offer only eventual consistency. (“Wait long enough, and you can read the right answer”—which isn’t any way to do financial transactions.)
Google Cloud Spanner, a fully managed relational database service running on Google Compute Engine (GCE) released in February 2017, has the scalability of NoSQL databases while retaining SQL compatibility, relational schemas, ACID transactions, and strong external consistency. Spanner is a sharded, globally distributed and replicated relational database that uses a Paxos algorithm for reaching a consensus among its nodes.
One alternative to Spanner, and the subject of this review, is CockroachDB, an open source, horizontally scalable distributed SQL database developed by ex-Googlers who were familiar with Spanner. CockroachDB borrows from Google’s Spanner for the design of its data storage system, and it uses a Raft algorithm for reaching a consensus among its nodes.
Like Cloud Spanner, CockroachDB is a distributed SQL database built on top of a transactional and consistent key-value store, in CockroachDB’s case on RocksDB. CockroachDB’s primary design goals are support for ACID transactions, horizontal scalability, and (most of all) survivability, hence the name.
CockroachDB is designed to survive disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. Of course, to accomplish that you need to run a cluster of many instances of CockroachDB’s symmetric nodes, using multiple disks, machines, racks, and datacenters.
Unlike Cloud Spanner, which uses the TrueTime API available for time synchronization in Google data centers, CockroachDB can’t count on the presence of atomic clocks and GPS satellite clocks to synchronize the time accurately across nodes and data centers. That has a number of implications. To begin with, Google TrueTime gives an upper bound for clock offsets between nodes in a cluster of seven milliseconds. That’s small enough that a Spanner node just waits seven milliseconds after a write before reporting that a transaction has committed, to guarantee external consistency.
Without TrueTime or a similar facility, CockroachDB must fall back to NTP, which gives upper bounds on clock synchronization between 100 milliseconds and 250 milliseconds. Given that larger time window, CockroachDB doesn’t wait after writes. Instead it sometimes waits before reads, basically restarting a transaction if it reads a value with a timestamp greater than the beginning of the transaction, again to guarantee consistency.
When all the nodes in a CockroachDB cluster have the small upper bounds for clock offsets that you can get from GPS or atomic clocks, which is something just becoming available on major public clouds, it can make sense to run them with the --linearizable
flag. That makes them wait for the max clock offset before returning a successful commit, just like Spanner.
How CockroachDB works
Each CockroachDB node consists of five layers:
- SQL, which translates client SQL queries to key-value operations
- Transaction, which allows atomic changes to multiple key-value entries
- Distribution, which presents replicated key-value ranges as a single entity
- Replication, which consistently and synchronously replicates key-value ranges across many nodes, and enables consistent reads via leases
- Storage, which writes and reads key-value data on disk
The SQL layer parses queries against a Yacc file and turns them into an abstract syntax tree. From the abstract syntax tree, CockroachDB generates a tree of plan nodes, which contain key-value code. The plan nodes are then executed, communicating with the transaction layer.
The transaction layer implements ACID transaction semantics with two-phase commits across the entire cluster including cross-range and cross-table transactions, treating single statements as transactions (also called auto-commit mode). The two-phase commit is accomplished by posting transaction records and write intents, executing read operations, and treating any write intents encountered as transaction conflicts.
The distribution layer receives requests from the transaction layer on the same node. It then identifies which nodes should receive the request, and sends the request to the proper node’s replication layer.
The replication layer copies data between nodes and ensures consistency between these copies by implementing a Raft consensus algorithm. You can control the replication factor at the cluster, database, and table level using replication zones. The consensus algorithm is used only for writes, and requires that a quorum of replicas agrees on any changes to a range before those changes are committed.
The storage layer stores data as key-value pairs on disk using RocksDB. CockroachDB relies heavily on multi-version concurrency control (MVCC) to process concurrent requests and guarantee consistency. Much of this work is done by using hybrid logical clock (HLC) timestamps.
Like Spanner, CockroachDB supports time travel queries (enabled by MVCC). These can go back as far as your most recent database garbage collection, done by default on a daily basis.
CockroachDB installation and use
CockroachDB runs on Mac, Linux, and Windows operating systems, at least for development and test. Production Cockroach databases usually run on Linux VMs or orchestrated containers, often hosted on public clouds in multiple datacenters. The Windows binary of CockroachDB is still in a beta phase and not recommended for production, and Apple no longer makes server hardware.
As you can see in the screenshot above, there are four options for installing CockroachDB on a Mac. I chose Homebrew as likely the easiest of the four.
By the way, Cockroach Labs posts a warning on the site that says running a stateful application such as CockroachDB in Docker is tricky, not recommended for production deployments, and to use an orchestration tool like Kubernetes or Docker Swarm to run a cluster instead. I’ll discuss container orchestration options in the next section.
Installation on a Mac is as easy as brew install cockroach
as shown above. In my case, the automatic update of Homebrew took a lot longer (enough time to brew tea) than the actual CockroachDB installation, which only took a few seconds.
Once you have CockroachDB installed, it’s fairly easy to spin up a local cluster, although there is the extra step of generating security certificates with the cockroach cert
command if you want the cluster to be secure. Once you have a cluster running (using the cockroach start
command once for each node, with subsequent nodes set to join the first node’s cluster), you can connect to the web administration page on any node with a browser, and use the embedded cockroach sql
client to send SQL queries to any node. Most browsers will complain about sites with CockroachDB-generated certificates, but you should be able to click through that dire warning and continue to the site.
The recommended CockroachDB production settings are different than the defaults, which were set up for development and test instances. You can develop on a one-node cluster if you wish. For production, you should have a minimum of three nodes, run each node on a separate machine, VM, or container, and give each instance extra cache and SQL memory. The default settings are 128 MB each for cache and SQL memory; the recommended production settings are to give each 25 percent of RAM:
cockroach start --cache=25% --max-sql-memory=25% <other start flags>
The more nodes you run, the better the resiliency will be. The bigger and faster the nodes, the better the performance. If you want to have nodes with performance roughly comparable to Google Cloud Spanner nodes, which deliver 2,000 writes per second and 10,000 reads per second, then you would want something like GCE’s n1-highcpu-8 instances, which have eight CPUs and 8 GB RAM, with local SSD disks (rather than spinning disks).
The more you distribute your nodes to different datacenters, the better you can ensure immunity to datacenter-level failures. There is a cost, however: The round trip latency between datacenters will have a direct effect on your database’s performance, with cross-continent clusters performing noticeably worse than clusters in which all nodes are geographically close together.
Cockroach Labs supplies detailed instructions for deployment on AWS, Digital Ocean, GCE, and Azure. The recommended configurations use load balancers, either the native managed load balancing services or open source load balancers such as HAProxy.
Orchestration can lower the operating overhead of a CockroachDB cluster to almost nothing. Cockroach Labs documents how to do this for production with Kubernetes and Docker Swarm. The CockroachDB-CloudFormation repository on GitHub shows how to use AWS CloudFormation and Kubernetes in a single availability zone for development and test. Adapting this for production would involve modifying the CloudFormation template to use multiple availability zones.
CockroachDB programming and testing
CockroachDB supports the PostgreSQL wire protocol, so you write your code as if you were programming against Postgres, or at least a subset of Postgres. This page lists the tested drivers for various programming language bindings, including most popular server-side languages. This page lists samples in 10 programming languages and five ORMs. I didn’t encounter any big surprises when I read through the code, although I did spot a few probable minor bugs in the listings within the documentation. You can also run your SQL using the interactive client built into the cockroach
executable.
While there is a repo dedicated to CockroachDB load generators and another for performance tests, benchmarking CockroachDB clusters is not easy, especially if you’re trying to compare CockroachDB to other databases in a meaningful way. One issue is that the network among the nodes can be the rate-limiting step in CockroachDB clusters.
Another fact to take into consideration is that most conventional SQL databases do not run in SERIALIZABLE isolation mode by default; instead they use a less strict mode with better performance. CockroachDB uses serializable isolation mode by default. Additionally, it would be a bit unfair to test CockroachDB’s SQL join performance, which is still a work in progress, with the TPC-C suite.
And yet you can easily see the operational power of CockroachDB. For example, many databases need to be stopped and restarted to scale them up. Adding nodes under load in CockroachDB is a breeze, especially if you’re using an orchestration tool. For example, the screenshot above shows the commands to change and display the nodes in a Kubernetes cluster, and the screenshot below shows the monitored cluster as the nodes are added. A load generation tool ran continuously throughout the process.
An even more impressive demonstration shows automatic cross-cloud migration within a CockroachDB cluster. It really requires video to do it justice; the video is hosted in the linked blog post.
CockroachDB SQL
SQL in CockroachDB is more or less standard, unlike SQL in Cloud Spanner, which uses non-standard syntax for data manipulation. CockroachDB SQL is still missing many features, however.
For example, V1.1 lacks JSON support, which is planned for V1.2. It also lacks XML parsing, which is not on the roadmap. It lacks row-level cascades, planned for V1.2, and lacks cursors and triggers, which are not on the roadmap. Geospatial indexes are “potential” additions that may make it to the roadmap in the future.
Most notably, the initial CockroachDB implementation of SQL joins in 2016 was deliberately simplistic and exhibited quadratic scaling, making it useless for querying large datasets. The version in V1.0, done by a co-op student, implemented hash joins, making many join operations scale linearly; that got CockroachDB about to the level of SQLite. Sometime in 2018, given a recent round of funding, CockroachDB should have join performance that scales more like PostgreSQL joins, as well as SQL join processing distributed over the cluster.
Like Google Cloud Spanner, Oracle Database, and Microsoft SQL Server, CockroachDB supports historical data queries, also known as time travel queries. These can only go as far back as the most recent garbage collection, however.