Smart and Scalable: The Next Generation Data Lakes

Data Lakes have come under some scrutiny in recent times, with complaints that businesses wind up storing anything and everything in them, yet aren’t realizing the value they expected from all the data they’ve piled up.

Sam Chance, who’s recently taken on a Managing Director role in Cambridge Semantics Solution Engineering organization, has a take on why they may be experiencing these issues. The problem, he says, has to do with their inability to understand the data. When companies pour disparate data from a myriad of sources into the ‘average’ Data Lake, what they’ve basically done is put things physically into the same place while leaving them logically or semantically disconnected.

“There’s no context,” he says. “Until you overlay a graph that specifies the relationships among the data from different contributing sources, there is no real value.”

The proposition behind the company’s graph-based Anzo Smart Data Lake is to intentionally and explicitly use Semantic Web technologies to provide that semantic layer that organizes and interconnects what would otherwise be disparate data in order for businesses to extract the maximum value from it. “That’s exactly the difference and the answer to the criticism of current Data Lakes,” he says. “They are like Minnesota, the land of 10,000 lakes. They are not connected and it’s really more of a data slough.”

Last fall Cambridge announced an enhanced version of the Anzo Smart Data Lake distinguished by the Anzo Graph Query Engine’s ability to now let users load their Semantic Data into memory using multiple commodity virtual machines or physical machines on-premise or in AWS or other Cloud platforms. So, they can query data at really high volume and performance levels. The traditional approach was to employ a single-server node where the memory and resources were constrained to whatever that machine afforded.

“Now that restriction is removed by letting multiple nodes to be added to load increasingly more data into memory,” Chance says. “That’s probably the biggest game changer – that horizontally scalable graph query engine.”

He also points to massively parallel processing technologies that are commonly described as Big Data or Big Cloud, like Apache Spark, as now being incorporated into the Anzo platform for parallelizing the ingestion of data, “which lets us play in the Big Data space but in a graph-based model.”

Where enterprises or projects have complex and diversified data, with many different concepts, the graph-based approach lets users query across a variety of entity types that in practice amounts to being able to query across all data sources, structured and unstructured, he says.

“So increasingly users are able to discover new information because they have the ability to ask arbitrary questions from the entire corpus of available data,” Chance says. “Users are empowered with the ability to ask questions across all their holdings.”

Full Integration, More User Power

Cambridge’s delivery of an end-to-end approach, from ingestion of data into the Anzo Smart Data Lake to distributed analytics to governance, is part of its work to build the case for and foundation of a Smart Data Lake strategy.

It maps out this way: From starting by using Apache Spark-based processing for ingestion, its Anzo platform then leverages RDF and OWL Semantic technologies to create expressive representations of incoming data and then also to store that in a standards-based model. The data is all made available on either existing file systems like HDFS or new ones companies may want to use, including proprietary systems. The Semantic Data, loaded into memory in horizontal scale-out models, is partitioned in such a way that users can query across the entire graph or multiple graphs at Big Data scale. Part of the design for providing a full “chain of custody” is making data available in catalogues where people can check it in and out without affecting data governance, lineage and provenance.

The ability to bring scalability to Semantics enterprise-wide makes it possible for this end-to-end approach to exist, says John Rueter, Cambridge VP of Marketing.

Data Scientists, of course, can use the system to configure views into data or to even use it as a way to curate data before loading it into traditional BI tools like Tableau or QlikView for business users to explore. But end users also can tap into its browser-based interface themselves to construct dashboards that provide views into data by simply navigating across the model, which is represented as one or more OWL ontologies. No semantic knowledge is required on their part.

“It’s so approachable and intuitive that business analysts are able to create dashboards that have views into the data all through what I consider an easy and intuitive interface, and with a minimal amount of training,” Chance says.

They’re enabled to create drill-downs to go from one idea or concept to explore further detail about it, including unstructured content. So, for example, if some data that contributes to a chart that a user builds comes from an unstructured text document, they can drill right back to the original document source where the specific information originated, he explains.

Full Speed Ahead

Giving business users that power helps take some pressure off of Data Scientists’ workloads but also, “Because of the graph-based nature of storing data, it’s reusable across the organization,” Rueter says. There’s no need for end users to go to programmers time and again to have them perform manual coding to get access to data as more and more questions occur to them. Insight is accelerated, in more ways than one.

“From the time data comes in to the lake to the time you can ask questions and get answers it’s just exponentially faster,” says Rueter. In December the company announced that its Anzo Graph Query Engine completed a load and query of one trillion triples as a Google Cloud Partner on the Google Cloud Platform in just under two hours. That was 100 times faster than the previous solution running the Lehigh University Benchmark (LUBM) – an industry standard that evaluates the query performance of Semantic Web repositories over a large data set – at the same data scale. “So besides flexibility and usability it’s the speed that is loading and querying of data, and eliminating any need for coding of data,” he says.

The concept of building solutions via configuration over coding, Chance summarizes, is that the end user is empowered and the gap between IT and the business further shrinks. Should there be an instance where the gap remains, Cambridge offers standard APIs and query endpoints for developers to work with the applications they require. For developers the APIs and workflows are cast in contemporary models to minimize any learning curve as it relates to interacting with semantic technology. Also, Chance adds that there’s no need for businesses to replace existing Data Lakes or other data deployments: “We can bring the Anzo Smart Data Lake in as a role player which provides the semantic layer and interfaces that with their analytics layer,” he says.

Photo Credit: PowerUp/Shutterstock

LEARN HOW TO IMPLEMENT MACHINE LEARNING IN YOUR ORGANIZATION

Data Topics

Smart and Scalable: The Next Generation Data Lakes

Leave a Reply Cancel reply