Review: Neo4j supercharges graph analytics

When it comes to tracking relationships, Neo4j is faster, more flexible, and more scalable than relational databases

Review: Neo4j supercharges graph analytics
Thinkstock
At a Glance

Neo4j is both the original graph database and the continued leader in the graph database market. Designed to store entities and relationships, and optimized to perform graph operations such as traversals, clustering, and shortest-path calculations, Neo4j shines at exploring data that consists of huge numbers of many-to-many relationships.

To understand why graph databases are different and sometimes desirable, we need to go back in time. Early databases were little more than indexed file systems for sequential records (ISAM) with fixed field layouts. Soon databases started expressing hierarchical relationships, such as departments belonging to divisions. Then they captured networks using sets with 1:n relationships; you could traverse the sets programmatically. The standard for the network database model was first issued by CODASYL in 1969.

When relational databases were first introduced, in the early 1970s, they were roughly half as fast as CODASYL databases, because of the overhead of the SQL query processor, especially when joining related tables. Fortunately computer hardware was becoming faster — Moore’s Law observed that the density of components on a circuit board was doubling every two years. That observation held for decades.

Relational database limitations

Relational databases are still going strong, and still need powerful server hardware. For really big data under heavy use, however, relational queries tend to slow down, mostly because of large join tables, contention for indexes, and complicated join logic.

Relational databases are not well suited to capturing ad hoc relationships that are not consistent across all records: You wind up with sparsely populated rows and way too many indexes, both of which slow down the database performance. Remember, the relational schema is fixed, so every record in a given table contains every field, whether or not the field is populated.

Standard, non-graph NoSQL databases — whether key-value, document-oriented, or column-oriented — typically store sets of disconnected values, documents, or columns. To connect them, you can embed an aggregate’s identifier inside a record belonging to another aggregate, but this isn’t very efficient. While there are many excellent use cases for all three kinds of NoSQL databases, connections aren’t their forte.

Enter the graph database

That brings us, finally, to graph databases and Neo4j. In 1999 Emil Eifrem and his colleagues at Neo Technology (now Neo4j, Inc.), needing to perform ad hoc analysis and frustrated by the limitations of relational databases, figured out a way to implement the 300-year-old mathematical graph model in a database with nodes as vertices and relationships as edges.

The Neo engineers created a labeled property graph in which nodes contain properties. Properties are key-value pairs, so the properties used in a given class of node may vary from one node to another. Nodes may have one or more labels. Relationships are named and directed (they always have a start and end node), and like nodes, relationships can also contain properties.

Neo Technology also created native graph storage and native graph processing using index-free adjacency rather than relying on a SQL back end. Neo4j complies with the ACID properties of transactional databases, has cluster support, and does runtime failover.

After 18 years of development, Neo4j is a mature graph database platform that you can run on Windows, MacOS, and Linux, in Docker containers, in VMs, and in clusters. Neo4j can handle very large graphs, even in its open source edition, and unlimited graph sizes in its enterprise edition.

Graph database scalability

Does an arbitrarily large graph make sense? Graph databases don’t suffer from the same scaling issues as relational databases (one of which happens when queries use complex joins of large tables), so a very large graph database is still likely to perform well, at least once the relationships have been created. In the Neo4j Enterprise edition you can add as many cluster nodes as you need for performance purposes; the open source Community edition is limited to one server.

One computationally expensive operation is matching the related items in disjoint nodes, the rough equivalent of constructing foreign key constraints in a relational database. But in a graph database that cost is only incurred when you’re building the relationships (for example during data import), not when you’re using them. If you try to do this kind of match in the Neo4j Desktop, you’ll get a warning that says “This query builds a cartesian product between disconnected patterns. This may produce a large amount of data and slow down query processing.” You can still perform the operation, however — the warning is intended for when that wasn’t really what you meant.

Neo4j installation and learning

To learn Neo4j, you should download and install the Neo4j Desktop and try some or all of the online Neo4j sandboxes, for example the Paradise Papers sandbox shown below. The Neo4j Desktop download includes the Neo4j Enterprise Edition for Developers, with a perpetual license. The sandboxes include data, interactive guides with example queries, and sample code. They expire three days after creation unless you extend them.

The Neo4j Desktop download has scripts to create a small movie database and to import the Microsoft Northwind sample database. There are sandboxes with data from the Panama Papers and Paradise Papers, the U.S. Congress, and others — including your own Twitter social graph, extracted from your account. You’ll learn a lot from going through all the samples and guides, although you’ll eventually want to read the documentation, especially for the Cypher query language.

In addition, the Neo4j Desktop has options to install the APOC (Awesome Procedures On Cypher) and graph algorithms libraries. APOC consists of about 300 Cypher procedures for various purposes, and the graph algorithms library provides efficiently implemented, parallel versions of common graph algorithms for Neo4j 3.x, exposed as Cypher procedures.

neo4j sandboxes IDG

Neo4j sandboxes are cloud container instances preloaded with Neo4j and (optionally) graph databases of interest along with tutorial scripts. In the figure above, I have created an instance with the Paradise Papers data set.

Neo4j data import

As shown in the guide to importing data and ETL, Neo4j can import tables into nodes from CSV files, create indexes on the nodes, create uniqueness constraints, and construct relationships using the Cypher query language. If you wish, you can eliminate the intermediate CSV files by selecting tables or views from your relational database programmatically using embedded SQL over its database driver, and adding the data programmatically to the Neo4j data using embedded, parameterized Cypher statements. A beta component of Neo4j Enterprise, Graph ETL, provides a GUI for the data import and schema mapping process.

Cypher Query Language (for SQL programmers)

Cypher looks partly familiar and partly strange to an experienced SQL database programmer (see the SQL to Cypher guide). Some examples of syntax that are the same in both languages: WHERE, ORDER BY, SKIP LIMIT, AND, and p.unitPrice > 10. The syntaxes that are different in Cypher have to do with graphs, patterns, and relationships — all the aspects unique to graph databases.

For example, node patterns are expressed in parentheses: (variable:Label). Attributes, as key-value pairs, go in curly brackets: (item:Product {name:"Chocolade"}). For those SQL mavens playing along at home, yes, that example comes right out of the Northwind database.

Relationship patterns are expressed as arrows, which may be annotated by attributes in square brackets: (x)-[someRel:REL_TYPE]->(y).

The rough equivalent of the SQL SELECT statement in Cypher is the MATCH statement, and the RETURN clause defines the result. Remember, you’re matching patterns. So

SELECT p.*
FROM products as p;

becomes

MATCH (p:Product)
RETURN p;

As I mentioned earlier, WHERE clauses are similar in both languages. However, you can use some shortcuts in Cypher that are not available in SQL. For example,

MATCH (p:Product)
WHERE p.productName = "Chocolade"
RETURN p.productName, p.unitPrice;

can also be expressed as

MATCH (p:Product {productName:"Chocolade"})
RETURN p.productName, p.unitPrice;

There are some minor differences in WHERE clauses between SQL and Cypher.  For example, the LIKE expression using % as a wildcard in SQL becomes STARTS WITH, CONTAINS, or ENDS WITH in Cypher. You can also use regular expressions to the same end in Cypher, for example p.productName =~ "C.*".

Like SQL, Cypher supports inner and outer joins, but the notation becomes MATCH with a relationship pattern for inner joins, and OPTIONAL MATCH for outer joins. Much of the scut work needed to define n:m joins with intermediate join tables in relational schemas goes away in Cypher, because the graph schema is explicit about relationships. Aggregates are simpler, too. For example, the SQL query to find the top-selling employees

SELECT e.EmployeeID, count(*) AS Count
FROM Employee AS e
JOIN Order AS o ON (o.EmployeeID = e.EmployeeID)
GROUP BY e.EmployeeID
ORDER BY Count DESC LIMIT 10;

becomes

MATCH (:Order)<-[:SOLD]-(e:Employee)
RETURN e.name, count(*) AS cnt
ORDER BY cnt DESC LIMIT 10

in Cypher. The GROUP BY clause is not needed in Cypher — it’s implied by the count aggregate. The JOIN clause on EmployeeID values isn’t needed in this particular case because the SOLD relationship pattern has already captured its intention.

All this takes some getting used to, but you’ll probably find that you like it. In programming, simpler is almost always better.

Neo4j graph analytics and graph algorithms

Graph analytics and graph algorithms help you to understand the organization and dynamics of complex systems. These can be applied globally to discover the overall nature of networks and model the behavior of intricate systems, and locally — possibly in real time — to provide a focused view of relationships between specific data points, as shown in the figure below.

Neo4j provides five path-finding and traversal algorithms including parallel depth-first and breadth-first searches, four centrality algorithms including PageRank, and six clustering algorithms including Louvain Modularity. Louvain Modularity is often used for fraud ring detection.

neo4j wilbur ross jr. to anthony grant blumberg IDG

This graph shows a graph algorithm (allShortestPaths) in action on the Paradise Papers data in Neo4j Desktop. Here we show the shortest connections between Wilbur Ross, Jr., the U.S. Secretary of Commerce, and Anthony Blumberg, then CEO of ConvergEx Group LLC, a broker-dealer. All paths between them go through Appleby Trust, offshore legal service providers operating in Bermuda, the Cayman Islands, and other tax havens. Blumberg was cited by the SEC for illegal practices.

At a Glance
  • Neo4j is vastly more efficient than SQL or NoSQL databases for tasks that look at networks of related items, but the graph model and Cypher query language will require learning.

    Pros

    • Native graph storage and native graph engine
    • Supports ACID properties
    • Has cluster support and runtime failover
    • Better performance than relational databases for “graph-y” applications
    • Open source version available

    Cons

    • Cypher query language is not exactly SQL and takes some learning
    • Graph database design is different from relational database design
    • Open source engine does not support clustering
1 2 Page 1
Page 1 of 2