The modern sense of NoSQL, which dates from 2009, refers to databases that are not built on relational tables, unlike SQL databases. Often, NoSQL databases boast better design flexibility, horizontal scalability, and higher availability than traditional SQL databases, sometimes at the cost of weaker consistency.
NoSQL databases can take a number of forms. They can be cloud services or install on-premises. They can support one or more data models: key-value, document, column, graph, and sometimes even relational—which is one reason that NoSQL is sometimes parsed as “Not Only SQL.” They can also support a range of consistency models, from strong consistency to eventual consistency.
Key-value is the most basic of the four non-relational data models. Sometimes other database models are implemented on top of a key-value foundation layer.
Column databases have keys, values, and timestamps; the timestamp is used for determining the valid content. Cassandra is a prominent example of a column database.
Document stores, such as MongoDB, have a query language or API for finding documents by content. They also have key lookups, like key-value stores.
Graph databases, such as Neo4j, explicitly express the connections between nodes. This makes them more efficient at the analysis of networks (computer, human, geographic, or otherwise) than relational databases. (I’ll cover graph databases in a future article.)
A few databases expose multiple data models. Some, such as Azure Cosmos DB, isolate the data models from each other. Others, such as FaunaDB, combine the data models.
Some of these databases support globally distributed data and possibly automatic sharding. For example, Amazon DocumentDB replicates six copies of your data across three AWS Availability Zones and allows for up to 15 read replicas. Amazon DynamoDB supports multiple regions and global tables.
Azure Cosmos DB is globally distributed and horizontally partitioned. YugaByte DB not only was designed for planet-scale applications, but also supports multi-cloud clusters, automatic sharding and rebalancing, and distributed ACID transactions.
Aerospike
Aerospike is a distributed, scalable, strongly consistent, schema-less key-value database for real-time big data operations. Data is structured in namespaces (the equivalent of an RDBMS database) and bins (the equivalent of RDBMS columns). Each bin supports certain data types: integer, string, float, list, map, geojson, binary objects, or language-serialized objects.
Aerospike keeps its primary and secondary indexes in RAM and its data either in RAM or on SSDs. You can back up Aerospike databases to hard disks. Aerospike is offered in community and enterprise editions.
The Aerospike “shared nothing” architecture is designed to reliably store terabytes of data with automatic fail-over, replication, and (in the enterprise version) cross data-center synchronization. This layer scales linearly.
Aerospike uses a Paxos-based algorithm to determine which nodes are considered part of the cluster. Clusters re-form any time a node is added or removed. Each node uses a distributed hash algorithm to divide the primary index space into data slices and assign owners. The Aerospike Data Migration module balances data distribution across all nodes in the cluster.
Aerospike Query Language (AQL) is a command-line utility with SQL-like syntax. You can also query Aerospike from more than 10 programming languages using its API.
Read InfoWorld’s review of Aerospike.
Amazon DocumentDB
Amazon DocumentDB is a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads. Amazon DocumentDB is designed from the ground up to give you the performance, scalability, and availability you need when operating mission-critical MongoDB workloads at scale.
Amazon DocumentDB implements the Apache 2.0 open source MongoDB 3.6 API by emulating the responses that a MongoDB client expects from a MongoDB server, allowing you to use your existing MongoDB drivers and tools with DocumentDB. The database service in the Amazon cloud also uses a distributed, fault-tolerant, self-healing storage system that auto-scales up to 64 TB per database cluster.
In Amazon DocumentDB, the storage and compute are decoupled, allowing each to scale independently. Developers can increase the read capacity to millions of requests per second by adding up to 15 low latency read replicas in minutes, regardless of the size of the data. Amazon DocumentDB is designed for 99.99% availability and replicates six copies of your data across three AWS Availability Zones.
Amazon DynamoDB
Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. It’s a fully managed, multi-region, multi-master database with built-in security, backup and restore, and in-memory caching for internet-scale applications. DynamoDB can handle more than 10 trillion requests per day and support peaks of more than 20 million requests per second.
DynamoDB global tables replicate your data across multiple AWS Regions to give you fast, local access to data for your globally distributed applications. For use cases that require even faster access with microsecond latency, DynamoDB Accelerator (DAX) provides a fully managed in-memory cache.
DynamoDB automatically scales tables up and down to adjust for capacity and maintain performance. Availability and fault tolerance are built-in.
Azure Cosmos DB
Azure Cosmos DB is a globally distributed, horizontally partitioned, multi-model database service. It offers four data models (key-value, column family, document, and graph) and five tunable consistency levels (strong, bounded staleness, session, consistent prefix, and eventual consistency).
The consistency levels below strong offer higher performance in exchange for weaker consistency. Roughly 70% of Cosmos DB tenants choose session consistency, which incurs only half the cost per read of strong consistency.
Cosmos DB offers five API sets: SQL (dialect), MongoDB-compatible, Azure Table-compatible, graph (Gremlin), and Apache Cassandra-compatible. Cosmos DB automatically indexes all data without requiring you to deal with schema and index management.
The design goals of Cosmos DB include elastic global scalability, low-cost operation, low read and write latencies, 99.99 percent availability, predictable and tunable data consistency, stringent financially backed comprehensive SLAs, automatic schema/index management and versioning, native support for multiple data models, and popular APIs for accessing data.
At the lowest level, Cosmos DB has a schema-agnostic, atom-record-sequence (ARS)-based database engine implemented on top of Azure Service Fabric. The four application data models are all projected onto the ARS-based core model.
As you might expect, the five database API sets don’t all map onto every data model. The SQL API used to be called the DocumentDB API; it applies to JSON document databases. The MongoDB API is also for document databases. Note that the wire protocol is different for the MongoDB API and the SQL API, so that one account can’t be used for both APIs, even though the document data can be migrated between the two. The Gremlin API is for property graph databases, the Azure Table API is for key-value tables, and the Cassandra API is for wide-column (column family) databases.
Microsoft has used Cosmos DB for internal applications for some time. For example, Azure Portal uses Cosmos DB as its global transactional store, as do Xbox and Skype.
Read my review of Azure Cosmos DB.
Cassandra and DataStax
Apache Cassandra is a highly available distributed data store that values availability and partition tolerance over consistency. The design of Cassandra combines the partitioning and replication of the Amazon Dynamo key-value store with the log-structured ColumnFamily data model of Google Bigtable. Cassandra scales linearly as you add nodes.
Consistency is not completely lost in Cassandra; it’s a tradeoff against latency. The user can specify the consistency level of each read and write, ranging from requiring only one node, through requiring a cluster quorum, to requiring all nodes. Another intermediate option is to require a local quorum, which is a way to attain consistency within a data center without waiting for remote nodes to update.
DataStax adds both features and performance to Cassandra. Among other improvements, DataStax eliminates the need to run repair scripts and eliminates the cluster outages that can occur when manual repairs fail; automatically keeps DataStax Enterprise nodes from overloading with client or replica requests; and uses a thread-per-core architecture that improves throughput up to 2x for both read and write operations.
Read my review of DataStax Enterprise.
Couchbase Server
Couchbase Server is a memory-first, distributed, flexible JSON document database that is strongly consistent within a local cluster. Couchbase Lite is a mobile version that can run locally and also synch to the server when connected. Couchbase Server scales both vertically and horizontally; in the Enterprise Edition you can scale different services independently for maximum performance, varying the number and size of nodes for each service, such as the data, index, query, and full-text search services.
Asynchronous operations help Couchbase to avoid blocking writes, reads, or queries. The developer can balance durability and consistency against latency when needed.
The Couchbase JSON data model supports both basic and complex data types: numbers, strings, nested objects, and arrays. You can create documents that are normalized or denormalized. Couchbase does not require or even support schemas.
You can access Couchbase documents through four mechanisms: key-value, SQL-based queries, full-text search, and JavaScript eventing. If your JSON documents have subdocuments or arrays, you can access them directly using path expressions without needing to transfer and parse the whole document.
You can install Couchbase Server on premises, in the cloud, and on Kubernetes. Couchbase Server Enterprise Edition is free for development and testing, and available by subscription for production. The open source Couchbase Server Community Edition is free for all purposes. Aside from some omitted features, Community Edition is API-compatible with Enterprise Edition.
With Cross Data Center Replication (XDCR), Couchbase Server does asynchronous active-active replication across clusters, data centers, and availability zones, to avoid incurring high write latencies. XDCR allows Couchbase to be a globally distributed database, at the cost of allowing eventual (rather than strong) consistency between clusters.
The Couchbase Server query language, N1QL (pronounced “nickel”), looks very much like standard SQL, with extensions for JSON, such as key and hash hints for joins and an IS MISSING
condition to handle values that have been completely omitted from a document. Couchbase offers SDKs for eight programming languages and three frameworks.
Read my review of Couchbase Server.
CouchDB
Apache CouchDB is an open-source document model database with a query engine, replication, and conflict resolution. It uses a RESTful HTTP API for queries as well as updates. CouchDB is implemented in Erlang.
The CouchDB file layout and commitment system feature all ACID properties. On-disk, CouchDB never overwrites committed data or associated structures, ensuring the database file is always in a consistent state. This is a “crash-only” design where the CouchDB server does not go through a shutdown process; rather, it’s simply terminated.
CouchDB read operations use a Multi-Version Concurrency Control (MVCC) model where each client sees a consistent snapshot of the database from the beginning to the end of the read operation. Documents are indexed in B-trees by their name (DocID) and a Sequence ID.
CouchDB is a peer-based distributed database system. It allows users and servers to access and update the same shared data while disconnected. Those changes can then be replicated bi-directionally later.
CouchDB allows for any number of conflicting documents to exist simultaneously in the database, with each database instance deterministically deciding which document is the “winner” and which are conflicts. When distributed edit conflicts occur, every database replica sees the same winning revision and each has the opportunity to resolve the conflict.
FaunaDB
FaunaDB is a distributed, strongly consistent OLTP NoSQL database that is ACID compliant and offers a multi-model interface. It has an active-active architecture and can span clouds as well as continents.
FaunaDB supports document, relational, graph, and temporal datasets from a single query. In addition to its own FQL query language, the company has announced support for GraphQL now, plus CQL and SQL in the future.
Google Cloud Bigtable
Cloud Bigtable is a public, highly scalable (up to petabytes), column-oriented NoSQL database as a service that uses the same code as Google’s internal version, which Google invented in the early 2000s and published a paper about in 2006. Bigtable was and is the underlying database for many Google services including Google Search, Google Analytics, Google Maps, and Gmail.