The graph database, very simply, is a database that recognizes the “relationships” between data to be as important as the data itself. A graph database is designed to hold data while not limiting it to a pre-established model. The data in such a database shows how each individual entity is connected with or related to others. A graph database “natively” embraces relationships while other databases compute relationships at the time of query using JOIN operations. A graph database stores its connections with the data in the model. In their early years, graph databases were “generally” regarded as a type of NoSQL or non-relational database, which were created to address the limitations of relational databases, but they’ve graphs have matured past such delineations and are considered their own type of innovative database technology.
Graph databases store data entities with graph edges that store each entity’s relationships. An edge will always use a start node, an end node, a type, and a direction. An edge describes a variety of relationships, such as ownership, actions, and sibling relationships. Edges are similar to the 1970s network databases, but network-model databases worked on lower levels of abstraction and lacked an easy traverse over a chain of edges. There are no limits to the variety of relationships nodes can express.
In a graph database, “traversing” the graph line connections, or relationships, is remarkably fast. This is the result of the relationships being included with the data as opposed to being calculated separately with each query. Graph databases are especially useful when working with recommendation engines, social networking, and fraud detection.
Gaurav Deshpande is Vice President of Marketing at TigerGraph. In a recent interview with DATAVERSITY®, Deshpande remarked that:
“Whenever customers ask me about [graph databases], I keep it very simple. When you hear the word ‘graph,’ graph is equal to relationship. So, any time you are trying to do analysis of relationships, that’s where you should use the graph database. And given that all of us are increasingly more connected to each other – both as people and as organizations, as entities – it just makes sense that graph databases would become more prominent and more important as time goes by.
According to Deshpande, graph databases allow organizations to do things that they cannot possibly do with relational databases or with SQL. That’s simply due to the amount of database joins that have to be processed.
“When you’re trying to find complex relationships between products, locations, suppliers, or even in healthcare between doctors, patients, the claims that have been filed, different policies that are part of each of the patient profiles, a graph database is the way to go.”
Graph databases can provide sophisticated fraud prevention. With graphs, relationships designed to process financial transactions in near-real time can be used. Fast graph queries can detect when a potential customer is using the same credit card or linked to an email address that was previously involved in a fraudulent transaction. Graph databases also detect relationship patterns. For instance, a graph database can detect people associated with a notorious or suspected email address or a variety of people using the same IP address while residing at different physical addresses. On the issue of fraud, Deshpande commented:
“China Mobile is an interesting use case. It’s the largest mobile service provider in the world. We serve over 900 million users [of China Mobile] doing real-time fraud detection with them. Every time a call comes in, based on the calling pattern alone, we can deduce whether a particular caller is a fraudster, or likely to be a fraudster. And then, if it’s likely to be a fraudster, we warn the person issuing the call.”
They [China Mobile] can reliable say that “there’s a high probability that this is a fraud call,” when it crosses a particular threshold. They then send the alert out to the call center and tell the customer that “you need to complete your Know Your Customer (KYC) information. We believe that this account is being used for fraudulent usage.”
Data virtualization can be useful in graph databases. Essentially, data virtualization can make graphs easier to work with. Graph databases can be difficult because they often use different query languages, requiring users learn a new query language focused on navigation of the data links, instead of querying structured tables. Virtualization side steps this issue by creating a virtual computer, allowing the use of a more familiar environment. Additionally, a virtual computer will support the use of familiar business insight tools.
Accessing data stored in a graph database typically requires a query language that is not SQL. SQL was designed for data manipulation in a relational system and does not support syntax to traverse the graph, said Deshpande. While there are a wide variety of graph query languages, most are tied to a product, and there is no universal graph query language as of yet, although TigerGraph, Neo4j, and other graph database companies are working together to define a standard graph query language. Access to some graph databases is provided by REST application programming interfaces (APIs). Deshpande shared a story on virtualization, saying:
Accessing data stored in a graph database typically requires a query language that is not SQL. SQL was designed for data manipulation in a relational system and does not support syntax to traverse the graph, said Deshpande. While there are a wide variety of graph query languages, most are tied to a product, and there is no universal graph query language as of yet, although TigerGraph, Neo4j, and other graph database companies are working together to define a standard graph query language. Access to some graph databases is provided by REST application programming interfaces (APIs). Deshpande shared a story on virtualization:
“One of Visa’s problems is they have a large infrastructure of servers, computer servers, storage areas, and they use virtualization a lot, like every data center operator out there. Virtualization is great, because it allows you to share resources, but it also creates interesting dependencies.”
If a particular network node goes down, one particular server goes down, one particular storage area goes down – what is the impact of that on the workload running on the affected component of network or IT infrastructure? Is it something that’s processing live payments coming in from the United States, from Europe, from Asia? Or is it something that’s simply handling a routine reporting workload that can easily be offloaded to a different server?
“So that impact analysis of something going down is what they manage in real-time with TigerGraph, and it’s a real-time Internet of Things use case, because you have IoT sensors attached to different computer servers, different storage arrays, which are used to track the health of each component and inform when it goes down. With TigerGraph, Visa can understand the impact of a component going down and devise next steps to minimize the disruption.”
Graph Database Innovations
Different graph databases use different methods of storage as well. Some use a relational database, storing the graph data in multiple tables. Other systems use key-value storage or a document-oriented database, putting graph databases in the category of NoSQL structures. Many graph databases using non-relational storage engines add tags or properties, providing the relationships with a pointer to another document and categorizing data for an easy retrieval, en masse. Deshpande noted:
“Key-value databases are great at storing documents and simple relationships. But if you’re trying to go deep into multiple relationships, beyond the second or third level of a relationship, you’re scanning the same large table that’s made up of billions of records, and it doesn’t scale.”
Other options have native graph storage, meaning the business entities and their relationships is modeled natively inside the graph database, “but did not have parallel loading or any kind of the parallel computation functionality to update or analyze the data or scale up for the increasing data volume,” he said.
Having a NoSQL database at the backend can work to a certain extent, “because it can allow horizontal scaling (scale out) of the solution,” he noted. But, there are also problems with that said Deshpande, because it’s not built on native graph storage of data, “it’s very difficult to scale for deep relationship analysis which goes beyond 2 to 3 levels or hops into the data.”
As he closed the interview, he discussed these options in terms of “graph generations.” The first graph generation got the storage right (native graph) but did not have the parallel processing for loading or for analysis. And the second one got the scale out right, but did not have the right storage for it, and then finally the third generation graph database is TigerGraph, a native parallel graph database:
“We store it as a native graph and then we scale it up and out using parallel computing and database sharding. So, one big difference is that you can take a half a terabyte of data or one terabyte of data graph, and you can distribute that graph into multiple servers of 100 gigabytes each being able to run real-time queries that access and combine data from all five servers.”
They can run queries across them, look at data from one server, one node, go to the next node in the cluster, and find the data from there and combine the two to bring back the results. “That’s a true distributed graph that’s data sharding. That’s something that’s not supported by any other vendor in the marketplace.” That’s the third generation of graph databases.
Image used under license from Shutterstock.com