Semantic Hub Architectures: A Complement to Data Lakes

By   /  November 19, 2015  /  No Comments

Data-LakeDelightful metaphors of data lakes, data reservoirs, data flow, and the defeat of data silos paint a picture that data, like water, is entirely fungible: With accommodations made for data to go through an enterprise data cleansing system and for changing the pipes so that access to it is ensured, it will stream smoothly and freely for use as needed. How do such metaphors fit themselves into one of the newest technological advances in the Data Management field these days, the Semantic Hub?

Kurt Cagle, Founder and CEO of consultancy Semantical, LLC, made these points at the DATAVERSITY® NoSQL 2015 Conference in August, and then proceeded to tell the audience why he thinks that these metaphors are broken. Data is not as clear and clean as water, he explained, but more of a messy combination – a little bit of H2O mixes up with some mud, tar, gasoline fuel, and other ingredients to form a vari able mixture. “Data is essentially chunky with more of the consistency of sewage or sludge than clear, pure, and pristine water,” he remarked.

The problem of data variability of all kinds across systems has its roots in the fact that each time an enterprise stood up a new database, a new data model probably was stood up based on enterprise requirements for that particular project as well. And little consideration was likely given to the idea that two different database models should use the same identifiers when talking about the same thing. Today, large companies may have thousands or tens of thousands of databases, and “you can almost say that there is a direct one-to-one association between a database and the model that it holds,” he said.

That’s left things in a state where it’s often a challenge for companies to find what they’re really looking for in their information-dense, mixed-liquid-worlds. “Data integration isn’t just a problem of putting in the pipes to make it happen as it is one of dealing better with what goes through those pipes,” he said. “We have to recognize that we are describing different things from system to system.”

On top of that, data modeling is an art, not a science. And while decades have passed since the birth of programming, that art is still being finessed. “We’re still in the process of figuring out what an object is,” he said. Every time technology scales up to a new level of abstraction – from the reign of independent databases to databases that facilitate departmental operations to enterprise databases and beyond – the need arises to consider new ways of thinking about information. “Because of that,” he said. “We are still largely fumbling around for a deep level of understanding of how we describe things.”

Fixes Exist, but so do Gotchas

There are, of course, workarounds to address the issues around modeling and integration. While it is unlikely that an Identity Management System was implemented upfront, such systems come into play when it becomes evident that so many databases are talking more or less about the same things that putting in place enough Master Data Management (MDM) systems will at least make it possible to get the “identity pieces to talk to one another so that we could recognize when something changes from one system to the next,” he said.

But, ultimately that’s a band-aid, he believes, replacing “a bunch of databases working independently with one more working furiously to track a few simple variables and a handful of properties.” That’s worsened by the fact that the properties that are established become locked in: The MDM model can’t be readily changed, which means that over time there’s attrition in the MDM solution. The enterprise loses value as the MDM decays.

There’s also the idea that an enterprise can move to mandating a data model that perfectly captures the business, but the problem there is that the business will evolve, he says. Some data that once was vitally important for business operations no longer carries so much weight, or things that looked like they were single entities suddenly begin developing exceptions, and the data model is at risk of ossifying unless the organization takes steps, such as instituting change management at a very fundamental level.

He also brought up the fact that over the last decade a lot of time and effort has been spent trying to address the data complexity issues by converting relational databases to relatively flat de-normalized structures.

“That was the way you did it,” he said. “It seemed the best way to present information. We had JSON and XML, and it was all cool until you look at the fact that you begin to lose relational information and you begin to lose information about structure. And all of a sudden we are now at a stage where we are beginning to realize that maybe we have thrown too much information out the window.”

And once information is removed, it’s very expensive to rebuild.

Working it Out 

So back to those data lakes, which are typically fed by so many databases – prominent among them are non-traditional NoSQL databases that are great at storing document-oriented content or hierarchical data structures, but for which the price is costly computation when it comes to building relationships across documents and querying across multiple dissimilar documents. Despite his dislike for the water metaphors, Cagle clarified that he does see the value of data lakes for pulling data together into a single repository to make querying easier since there is just one access point to deal with. But what still exists, he said, is the problem of there being so many models involved. Ultimately, there’s a good foundation in the approach, but “you need more,” he said. “The drawback is that you consolidated the data, not the model.”

That “more” is a Semantic Hub architecture. “That seems to be emerging as the next stage modification,” he said, which provides a route to consolidating data as well as a way to surpass largely text- and tag-based search. It’s not about getting rid of the data lake, but about creating a construct around it, the Semantic Data Hub. The approach enables building a canonical model to pull together information from the various pieces in the data lake for organizational purposes, as well as essentially building the linking structures that connect these pieces together to create canonical data, he said.

“So you can create canonical instances of information that can contain deep content as you need to, but doesn’t have to contain everything. That canonical model in turn can effectively become your primary repository for doing fast searches and the semantic layer can actually be used to create an even faster layer on top of that.  So you basically are able to create three tiers of information: a very, very fast semantic layer;  a reasonably fast and consistent canonical model; or a slower but still accessible deep level that basically gives you access into your source data structures in turn,” he said.

What’s among the big benefits of a Semantic Hub? According to Cagle, it consolidates models as well as data, provides identity management greater than MDM, and it becomes increasingly critical for linked management. He also said it provides the tools to give organizations a robust Reference Data Management system and to more easily incorporate governance and provenance information. With Semantics, where searches are all done on the index, searches can be done quickly on relationships, including extended relationships. There are many more to be had, as well.

For instance, he discussed the further values of a Semantic system with its triple store index, designed as it is as a way to pre-compute relationships and store them, which makes it possible to use that information to get not only references into documents, but also into existing data and into derived data. “And,” he said. “You can utilize it to derive data itself that can in turn be stored back into the system,” providing the system with the capability to learn.

That includes learning in such a way that the system can essentially rebuild its own model dynamically, making it much more resilient to change factors. “That,” he said. “Is pretty exciting.”

About the author

Jennifer Zaino is a New York-based freelance writer specializing in business and technology journalism. She has been an executive editor at leading technology publications, including InformationWeek, where she spearheaded an award-winning news section, and Network Computing, where she helped develop online content strategies including review exclusives and analyst reports. Her freelance credentials include being a regular contributor of original content to The Semantic Web Blog; acting as a contributing writer to RFID Journal; and serving as executive editor at the Smart Architect Smart Enterprise Exchange group. Her work also has appeared in publications and on web sites including EdTech (K-12 and Higher Ed), Ingram Micro Channel Advisor, The CMO Site, and Federal Computer Week.

You might also like...

Data Science in 90 Seconds: K-Means Clustering

Read More →