The best graph databases

These stellar databases combine horizontal scalability with highly efficient engines for storing and analyzing connected data

Contributor, InfoWorld |

Graph databases, which explicitly express the connections between nodes, are more efficient at the analysis of networks (computer, human, geographic, or otherwise) than relational databases. That gives graph databases a leg up for applications such as fraud detection and recommendation systems.

One of the major draws of graph databases is the ability to run graph computational algorithms. These are used for tasks that don’t lend themselves well to relational databases, such as graph search, pathfinding, centrality, PageRank, and community detection. Graph algorithms are mostly supported in analytical (OLAP and HTAP) graph databases, although some transactional (OLTP) graph databases such as Neo4j support them.

All of the graph databases discussed here have good horizontal scalability. Some also support read replicas, global distribution, and automatic horizontal sharding.

Amazon Neptune

Amazon Neptune is a fully managed transactional (OLTP) graph database service with ACID properties and immediate consistency, which has at its core a purpose-built, high-performance graph database engine that is optimized for storing billions of relationships and querying the graph with milliseconds latency. Neptune supports two of the most popular open source graph query languages, Apache TinkerPop Gremlin and W3C SPARQL.

Neptune database clusters can have up to 64 TB of auto-scaling storage in six replicas of your data across three availability zones, and more if you enable high availability by using read replicas in additional zones. Neptune automatically detects database crashes, and restarts—typically in 30 seconds or less—without needing to perform crash recovery or to rebuild the database cache, since the cache is isolated from the database processes and can survive a restart. If an entire primary instance fails, Neptune will automatically fail over to one of up to 15 read replicas. Backups are continuously streamed to Amazon S3.

[ Click here to sign up for a free three-hour course on getting started with Kubernetes, presented by Pluralsight and InfoWorld. ]

You can scale Neptune clusters up and down either by modifying instances or, to avoid downtime, by adding an instance of the desired size and shutting down the old instance once a copy of the data has migrated and you have promoted the new instance to primary. Neptune VM instance sizes range from db.r4.large (two vCPUs and 16 GiB of RAM) to db.r4.8xlarge (32 vCPUs and 244 GiB of RAM), giving Neptune a 16x dynamic range for writes and a 256x dynamic range for reads (counting the read replicas).

Read my review of Amazon Neptune.

AnzoGraph

AnzoGraph is a massively parallel, in-memory OLAP graph database that works with enterprise data sources and does parallel data loads of RDF and CSV formats. AnzoGraph can be deployed in single-node sandboxes, or in clusters with as many nodes as needed for production. AnzoGraph has ACID transaction properties.

AnzoGraph uses W3C-standard RDF triple and quad data and SPARQL 1.1 queries. It supports labeled property graphs as part of the RDF store, conforming to the proposed RDF* and SPARQL* standards, and it has extensions to SPARQL to support graph algorithms, inferencing, window aggregates, BI functions, and named views. Support for the Neo4j-compatible OpenCypher language and the Neo4j protocol Bolt is planned.

AnzoGraph features high-performance graph query execution and scalability to billions and even trillions of triples, as well as fast parallel data loads that don’t require taking the database offline. AnzoGraph clusters can be deployed on CentOS, Kubernetes, and AWS. Google Cloud Platform and Azure deployments of AnzoGraph are usually treated as Kubernetes deployments. AnzoGraph has demonstrated scalability to 40 nodes in a synthetic benchmark.

Read my review of AnzoGraph.

Neo4j

Neo4j is a scalable OLTP graph database with some OLAP capabilities. Neo4j was the original graph database, first created in 1999, and continues to be a market leader.

While the open source Neo4j Community Edition is limited to a single server, the Neo4j Enterprise Edition allows you to add as many nodes to a cluster as you need for performance purposes.

Every node in a Neo4j high availability cluster contains the database and a cluster management component, and the cluster can be accessed through a load balancer. The full graph is replicated to each instance of the cluster, and the read capacity of each HA cluster increases linearly with the number of server instances. Neo4j can commit tens of thousands of writes per second while maintaining fully ACID transactions.

[Get a tutorial introduction to big data analytics with Neo4j--learn a more natural way to model complex relationships in your enterprise systems. ]

In a Neo4j causal cluster, a core cluster of read-write servers is combined with one or more asynchronously updated clusters of read replicas. Any application is guaranteed causal consistency, meaning that it is guaranteed to read at least its own writes, even when hardware and networks fail. The read replicas in a causal cluster may be geographically distributed to improve query performance for users near the replicas.

Read my review of Neo4j.

TigerGraph

TigerGraph is a real-time, native parallel, HTAP graph database available for deployment in the cloud or on-premises. TigerGraph supports ACID properties, has built-in data compression, automatically partitions a graph within a cluster, and claims to be faster than the competition. It uses a message-passing architecture that is inherently parallel in a way that scales with the size of the data.

TigerGraph was designed to be able to perform deep link analytics as well as real-time online transaction processing and high-volume data loading. By “deep link analytics,” TigerGraph means following relationships from a vertex through the graph for three or more hops and analyzing the results.

While several open-source graph query languages have been widely adopted, such as Cypher, Gremlin, and SPARQL, TigerGraph has a new query language, GSQL. GSQL combines SQL-like query syntax with Cypher-like graph navigation, plus procedural programming and user-defined functions. TigerGraph can convert Cypher to GSQL for people moving from a Neo4j database.

TigerGraph has a managed cloud offering that is currently in limited preview. TigerGraph has demonstrated a 6.7x speedup when running a read-write cluster with eight machines, but hasn’t said anything about read replicas or geographical distribution.

Read my review of TigerGraph.

Next read this:

Martin Heller is a contributing editor and reviewer for InfoWorld. Formerly a web and Windows programming consultant, he developed databases, software, and websites from 1986 to 2010. More recently, he has served as VP of technology and education at Alpha Software and chairman and CEO at Tubifi.