A challenge for large businesses working with data is how to manage and unify it. Many companies, seeking to better utilize their data, collect massive amounts of data regarding their customers’ online activity, preferences, and demographics. This can lead to the development of vast data lakes. Without a screening process, a data lake will enable “data hoarding.” A badly organized data lake is often pejoratively called a data swamp.
Additionally, businesses are supporting a growing number of applications — both legacy and new — making the data more distributed and disparate, and more difficult to manage. Data integration solutions combine datasets but are not well-suited for changing data requirements. Data unification can provide an answer. Data unification is currently an expensive and ongoing process, used by larger, well-financed organizations with vast amounts of data from multiple sources.
Data unification merges and unifies data from a large variety of sources. The process involves data cleansing, eliminating unnecessary copies, classification, and schema integration, thereby providing one accurate and unified data source. Finding efficient ways of storing, accessing, and processing “disjointed” data is a major issue facing many organizations. Data unification allows for the merging of data so it may be mined for useful business intelligence taken from past business activities or for creating predictive models.
Evren Sirin is the chief scientist of Stardog and one of its founders. Stardog offers a knowledge graph platform that helps large organizations link and query complex data in a scalable, extensible, and efficient manner. Companies including Morgan Stanley, Boehringer Ingelheim, Bosch, and NASA use Stardog to power applications focused on customer insight, supply chain optimization, and IoT and to improve optimize internal processes including systems engineering, knowledge management, and human capital management. According to Sirin:
“You don’t want to just unify the data, you also want high explainability into the analytical insight derived from the data. One of the reasons leading organizations choose to build a knowledge graph with Stardog is because of our approach to data modeling. We support reusable, declarative models where you can encode the mapping between the information, those data silos, and your overall picture in the knowledge graph. This provides complete traceability.”
A data warehouse stores integrated data hierarchically in files or folders. These often wind up being organizational data silos – information under the control of a single department, isolated from other data sources. A data lake stores vast amounts of raw data in its native format. In data lakes, a data element is assigned a specific, unique identifier, along with an extended metadata tag, for purposes of locating it. Data lakes are generally associated with Hadoop‘s oriented storage, with data mining tools and analytics working with the data on Hadoop’s cluster nodes.
One problem with gathering and storing large amounts of data in one location is avoiding the dreaded data swamp, said Sirin. Tracking the data, what it is, how accurate it is, where it’s from, and how long it will be relevant are all necessary pieces of information when planning a data lake and avoiding a data swamp scenario. But you don’t need to get rid of your data lake. Knowledge graphs sit on top of data lakes and help you make sense of the data. Sirin commented that:
“A knowledge graph is designed to combine data of all structures and schemas. The data you need doesn’t just live in rows and columns. It also lives in PDFs, enterprise applications like JIRA or Yammer, and many other places storing unstructured data.”
So, it’s necessary to find the information and the data structure, map it to the nodes in the graph, and relate to all the other relationships that exist. “So, that’s an important part of the unification problem,” he said.
“Unification does not mean all data must be physically located in the same system. Stardog has been a big proponent of virtualizing data for years and is one of the only platforms that lets you both virtualize and materialize data. Virtualization allows you to access data without copying it – you can leave it in place – and this also ensures your data is always up to date. Regardless of your implementation, it’s all powered by the same reusable model.”
Data extraction is the process of collecting data from various sources. Frequently, companies will extract data with the intention of processing it further, or to send the data to a repository (a data warehouse or a data lake). A unified system would normally transform the data in this process.
The extraction process for unstructured data typically uses natural language processing (NLP), text mining, and machine learning to identify relevant concepts within the data. In a knowledge graph, and these concepts and their various inter-relationships are stored in the graph, together with a connection to the original document and its metadata. This data can then be analyzed and reasoned with. On this topic, Sirin commented:
“We have an NLP pipeline built into the system that comes with NLP models trained on public sources, like Wikipedia, for example. But it’s not a magic bullet in the sense that NLP is very domain specific and a domain like the financial sector will have very specific terms about loans, credit swaps, etc. You can get started with connecting the contents of text documents to your knowledge graph using the general-purpose models we provide and fine-tune the models according to your domain over time.”
It’s possible to put it into the pipeline so the documents can come in once, go through specialized processing, and then output will go into the knowledge graph, and link to the other parts. “Tying these specs to some of the other things – machine learning, analytics – is an important part of making sense of your data,” he said.
Knowledge Graphs and Data Unification
A Stardog white paper suggests knowledge graphs work well in unifying data and in showing relationships. A knowledge graph connects all data without moving or copying it. A knowledge graph seamlessly layers atop the existing data infrastructure, and reveals the relationships within the data, regardless of its source or format. These graphs are highly scalable, and retain each analysis, which acts as a reusable asset.
Stardog’s knowledge graph includes a highly performant graph database. Essentially, a graph database (GDB) is designed to assign the same importance to the “relationships” between data, as to the data itself. The data is stored with interconnections and it can be shown how each separate data point is connected or related to others. While other databases create connections at the time of the query (by way of expensive JOIN operations), graph databases store the “relationships with the data.” A knowledge graph is an enhanced graph database, enriched with business rules that allow for inference to be performed upon the connected data. Stardog’s application of knowledge graphs for data unification is novel; capabilities including virtualization and NLP allow for more of an organization’s data to be unified for more flexible use. On the subject of knowledge graphs, Sirin said:
“We see more and more the realization that people need a knowledge graph view. Data is an important strategic asset for every industry, particularly in life sciences and manufacturing. Using traditional Data Management approaches — like data lakes and warehouses – do not meet the dynamic needs required to solve modern problems and takes too long. It’s too expensive. It’s not adjusted to the changing needs because every year there are new requirements coming.”
There is certainly a growing interest in knowledge graphs right now. With any technology, a successful implementation requires you to start small and pace yourself. Technology is only one part of the puzzle. Many internal organizational structures actually prevent sharing data cross-functionally. Teams stood up databases for their specific data and the result was hundreds of databases with duplicative and inconsistent data. “We partner with organizations to increase their Data Quality as they implement a knowledge graph. There is no need to wait for 100% Data Quality before you start to make sense of your data. You’ll never achieve that.” The goal is to take steps toward connecting data rather than collecting data. Connections are the key to unlocking new insights – not building bigger databases.
Image used under license from Shutterstock.com