BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Graph Databases Go Mainstream

This article is more than 4 years old.

Getty

Every decade seems to have its database. During the 1990s, the relational database became the principal data environment, its ease of use and tabular arrangement making it a natural for the growing needs to power the data web. While relational databases remained strong, the 2000s saw the emergence of XML databases, and NoSQL, the idea that databases didn't need to be structured in a purely tabular form, began to get hot. During the 2010s, JSON databases gained traction, along with the spectacular rise and ultimate fall of Hadoop as a data platform. Given that, there is compelling evidence that graph databases will be the go-to database of the 2020s,

Graph databases have been around in one form or another since the early oughts, but they were generally slower, more complex to work with, and more limited in terms of their applicability than relational databases. They were also seen primarily as academic databases, as one of the earliest use cases for such databases was to build logical analysis systems, and because of its academic associations, the database stayed well under the commercial radar for a significant number of years.

By mid-2014, however, several key advances had all come together to give graph database technology its first of several boosts.  Neo4J, an early and still heavily used graph database, had begun to gain enough maturity and penetration that it began to be used for certain classes of mathematical graph processing. Hardware (through cloud computing) had also become fast enough that many of the key early challenges of performance could be overcome. A graph query language (SPARQL) came out with a second edition that solved many of the problems that an earlier version had brought with it, including the introduction of an update capability that meant that a consistent mechanism could emerge for adding content dynamically. The emergence of JSON data stores such as MongoDB and CouchDB had put a lot more focus on the challenges of indexing non-traditional databases, which has, in turn, led to significant optimization in handling joins, one of the core requirements of any database but especially important for graph databases.

As important, several companies began experimenting with graph databases to solve problems that were beginning to become vexing at the corporate level - management of enterprise metadata, master data management, natural language processing, knowledge navigation and other related problems where search technology by itself had bottomed out. Ironically, machine learning, which typically uses a data clustering approach to text analysis as a brute force alternative has recently increasingly become just another mechanism to help build graph databases, to the extent that the most recent graph databases are now incorporating machine-learning algorithms and tools as part of their core suites.

Getty

Understanding Graphs

Finally, graph databases are increasingly taking advantage of a completely unexpected development - the growing power and sophistication of graphics processing units, or GPUs, those specialized multi-core computers that are dedicated to creating sophisticated three-dimensional graphics. These GPUs work by creating meshes broken down into triangles - millions or even billions of triangles. By assigning properties to the points (nodes) and lines (edges) that make up these meshes, the GPUs can calculate color, lighting, hardness and softness of edges, displacement and dozens of other properties, and can do so very, very quickly because of the pipeline architecture that they employ.

It turns out that a graph database is really not that much different (in fact they are nearly identical) to the meshes used in graphics processing, especially from the standpoint of assigning properties to nodes and edges. Each node represents a concept, each edge represents a relationship. This means that, especially towards the end of the 2010s, graph database vendors are increasingly taking advantage of GPUs to traverse and compare node values along these graphs, effectively taking advantage of the massively parallel capabilities such GPUs offer to power graph databases. While more traditional CPUs will likely continue to be used as well, graph databases on GPUs are likely to be the paradigm that truly takes hold in the 2020s.

Currently, graph databases are divided into two broad categories: property graphs and semantic graphs.  They differ in one critical respects: a semantic graph treats the edge as a globally named object that identifies a relationship, while a property graph treats the edge as something unique based upon the surrounding nodes. This means that it is possible to assign literal properties (such as dates or strings) to a relationship in a property graph. A similar property, called reification, exists in semantic graphics, in which the nodes + edge together are given an identifier, but currently SPARQL does not yet intrinsically support this capability.

It is very likely that this will change with the next version of SPARQL, which will probably formally emerge in the early 2020s, along with some changes regarding variable predicate paths. What that means in practice is that property and semantic graph databases will merge together into a single category within the next few years. In addition to this, you're seeing the merger of graph databases with NoSQL databases (in either JSON or XML flavor, or in some cases both).

Graph databases are inherently more flexible than traditional relational database systems because it is possible to treat the metadata about the database as data itself, accessible in exactly the same way. You can, in fact, easily represent a relational database using graphs, because relational table/column/row/keys structures are themselves a form of graph. Until comparatively recently, this flexibility came at the expense of performance - but with the shift to GPUs, that performance advantage has largely disappeared.

More and more systems are also enabling relational shims that hide the details of the differences between tabular and graph databases so that they can work with existing toolsets (such as those used for data analytics or visualizations) but as full graph features become standardized, new tools will emerge to take advantage of that standardization in order to take advantage of both flexibility and performance, meaning that graph databases have the potential to replace the existing relational market by 2030.

Getty

State of the Market

At one point, about ten years ago, there were maybe four commercial and a couple open source graph databases, all of them collectively serving a market that was microscopic in comparison to the relational market. I personally can remember around then walking the trade show floor at a couple of semantics conferences where the number of hopeful vendors outnumbered the participants. That's no longer the case, and both the number of vendors and the number of graph database experts have mushroomed in the last few years, to the extent that it's worth discussing different groups of vendors.


The Stalwarts

This group includes those vendors that survived the nuclear winter in Semantics from 2008-2012.  They tend to have the longest track records, but also tend to focus on functionality such as inferencing and SPIN that for the most part predated the second release of SPARQL, even though they fully support that version. These include products such as Top Quadrant's Top Braid, Cambridge Semantic's Anzo, OntoText's GraphDB, and Franz's Allegrograph among others. All of these are good products, and Top Braid in particular continues to remain innovative in this space with their work on SHACL (a data schema language). They also tend to have broader data governance strategies that make them good products for implementing traditional semantics at an enterprise level.

The Hybrids

This group contains those vendors that combine triple stores (graph databases) with other integrated database systems, such as JSON or XML stores, or with search engines. In many respects, these are transitional systems that repurposed (or built on top of) existing indexing technology, and for the most part moved into the graph space in the period from 2012 to 2017 or so. These include MarkLogic, which provides support for JSON and XML documents and semantics, and Stardog, which is built on a JSON foundation but incorporates a sophisticated indexing technology.  Hybrid systems work well when you're in a mixed data regime, though they also tend to be priced higher than other graph databases.

The Young Turks

This group in general has emerged as a force only a couple of years ago, and include players such as Pool Party, BlazeGraph (and its latest incarnation as Amazon Neptune),  Tiger Graph and Data.World. I'd also put Neo4J in this category, despite their having been around for a while. In most cases these products focused upon graph databases as property graph databases, and in some cases have jettisoned features such as inferencing that has become increasingly irrelevant as SPARQL achieves broader adoption.

The Supporters

These are companies such as SmartLogic (or the recently acquired Capsenta), which typically are built on top of (and work with) multiple triple stores. They typically provide support for data modeling, document entity enrichment, high-speed ingestion and integrated visualization. There are clear signs that as the market becomes more crowded, the Supporters will likely either establish strategic partnerships with one particular vendor or will merge outright (my suspicion is that MarkLogic and SmartLogic are trending in that direction in particular).


Getty

Knowledge Bases as Biggest Growth Sector

There are a number of factors that all point to graph databases gaining considerable traction in the next decade. One of the first is the rise of the knowledge base, which can be thought of as an encyclopedia for data.

Consider, for instance, a fantasy baseball league. They typically have a number of different classes of things that they are attempting to keep track of: players, positions, teams, seasons.  As the number of classes increases, the ability to create interfaces to view the various items in each class drops and the complexity of those apps grows. Traditional database applications get overwhelmed at that point because they generally are attempting to provide a specific view on only one part of the data space involved.

A knowledge base takes a different approach, one often referred to as a context-free design, in which every object in your system is considered to be a node in a graph. In a knowledge base, each node has a minimal set of properties that allow it to be displayed in the graph. I like to use the card analogy here, as in a baseball card, where the card in question might be a pitcher, a catcher, a team, a batter and so forth, but all need to have a title, a picture, and a few other basic properties so that the cards can be printed.

For those familiar with the term, a context-free approach is conceptually similar to the approach the web takes with REST, in that the role of the data being requested or published generally doesn't matter, only that this data and process follows certain basic protocols.  However, the details do matter and typically overlap in ways that are difficult to manage with pure hierarchies. For instance, with the baseball league, information is often organized hierarchically such that, in a given season, a team has a roster of players, each of which has certain roles (pitcher, catcher, shortstop, etc.).

This gets more complicated when you start considering inter-season trades and similar actions, and also gets more complicated because the attributes that are pertinent to a pitcher (pitches per game, number of strikeouts, etc.) are different from those of a position player acting as a batter (earned runs, number of walks, etc.) - but only in the National League, because the American League lets pitchers bat, but the National League does not.  Modeling this kind of logic puts significant constraints on a typical database, but is actually one where knowledge bases excel, because properties are themselves resources that are accessible as data and because properties themselves can have metadata.

This ability makes it possible to effectively build your queries by navigation, rather than being familiar with a query language. Additionally, you can create additional filters (what are usually called constraints in semantics) that allow you to find all players who were pitchers for the Seattle Mariners in the 2018 season, sorted by average number of pitches per game, without needing to understand the database tables or data structure, and typically without even needing to write a line of code.

This flexibility also extends to the ability to create data services. For instance, it's possible with a knowledge base to transform the above query directly into a URL that will return a data feed giving current data in a machine-readable (JSON or XML) format that other applications can use, without having to have a programmer write up a new data API.

Knowledge bases may go a long way towards consolidating enterprise data while letting organizations continue to build out both in-house and public-facing applications. Newer technologies are also making it possible to provide property-level security, where a given user will only be able to see or navigate across specific properties if they have the relevant access level.

Machine learning systems are also making it easier to do auto-categorization for knowledge base ingestion, and this will only become more prevalent over time as the benefits of ML + Semantics become better well proven. Other applications, including inferential analysis systems, risk detection and mitigation, least-path analysis and the like, all applications with semantic components, will likely piggyback on top of the success of knowledge bases. They require a somewhat different organization of data, but as data federation also becomes more prevalent, this is an evolutionary, rather than a revolutionary change.

One final point to note - one of the more exciting arenas of semantics of late is in the potential for unification with property graph technology, where assertions (statements of truth) are treated as resources in their own right. Early work with RDF support this, but due to performance reasons the SPARQL query language specification, used by (most extant) knowledge graphs, didn't implement either it or variable property paths.

I am anticipating that a SPARQL 2.0 recommendation from the W3C  (which holds most of the core patents on semantic technology) may be on the horizon, but likely won't become a reality until 2022 at the earliest. If it does, the result will likely be some form of synthesis of SPARQL and Gremlin and/or Tinker Pop.

Getty

Summary

Given all that, the roadmap for semantics adoption is becoming increasingly clear. Look for semantic data systems to become more oriented around knowledge bases rather than inferential analysis, at least at first, while at the same time incorporating the latest advances in machine learning and auto-categorization.

The big boys in the data space are now watching this sector closely, and I expect to see a wave of consolidation as strategic partnerships lead to outright mergers and the A-tier players begin snapping up viable tech rather than building it in-house, with the likelihood of this being high within the next two years (especially if there is the potential for distressed sales in an economic downturn).

At the same time, expect to see the re-merger of semantic and property databases as standards get hammered out, more than likely with some formal action out of the W3C towards such a reunification.

 

Follow me on LinkedIn