Hadoop is on almost every enterprise’s radar – even if they’re not yet actively engaged with the platform and its advantages for Big Data efforts. Analyst firm IDC earlier this year said the market for software related to the Hadoop and MapReduce programming frameworks for large-scale data analysis will have a compound annual growth rate of more than sixty percent between 2011 and 2016, rising from $77 million to more than $812 million.
Yet, challenges remain to leveraging all the possibilities of Hadoop, an Apache Software Foundation open source project, especially as it relates to empowering the data scientist. Hadoop is composed of two sub-projects: HDFS, a distributed file system built on a cluster of commodity hardware so that data stored in any node can be shared across all the servers, and the MapReduce framework for processing the data stored in those files.
Semantic technology can help solve many of the challenges, Michael A. Lang Jr., VP, Director of Ontology Engineering Services at Revelytix, Inc., told an audience gathered at the Semantic Technology & Business Conference in New York City yesterday.
Hadoop “provides the ability to do analysis in a really large-scale way, with all enterprise data, regardless of size, format, and structure,” he said. “That’s a huge area where you can get competitive advantage. You can start to figure things out about your business, competitors, and partners.”
What might be holding companies back from really exploiting Hadoop to such ends are factors like there being little to no data management capabilities. “It really mainly is a place where you can store files. As far as understanding what data is in what files, how it is related to other data sets…there is not a lot of capability within the core Hadoop,” he said.
It’s not only difficult to understand the metadata around data sets, but just understanding what the schema of a data set is is difficult, he said. “That’s something we take for granted in the relational database world because it’s always there.”
Additionally, MapReduce doesn’t work for all kinds of data processing a company might want to do, around transformations and analysis, for instance. “If you have to repeat something you want it in a place where people can find, reuse, and share it with each other. If you apply a transformation to a data set you want to know what happened,” he said. Data integration and provenance also are concerns.
Revelytix is aiming to address such issues with technology it’s building called Loom. Its registry enables quick discovery and use of data sets, tracking them and every step executed in transforming data, so that users know how data sets are related to each other, and their provenance, too. The Loom Transformation Engine leverages a lot of Revelytix’s core technology that also works outside Hadoop and is heavily based on semantics, he noted. R2RML, a language for expressing customized mappings from relational databases to RDF datasets, plays a big role in the technology, for instance.
From looking at data and writing a schema, and then mapping each one to an ontology, it becomes possible to write a query supported by the data in terms of the ontology. “Once the mapping is in place, data scientists don’t need to know the details of the underlying schema,” he said. Data scientists can write the query related to the business problem they want to solve in the language they understand, from SQL to SPARQL. “And Loom translates that to MapReduce and MapReduce does the work of distributing that workload, parallelizing execution, and getting you everything you get out of working with Hadoop.”
Where semantics fits in the world of Hadoop, he pointed out, “is helping you prepare data, work with data and integrate it for some very specific analytic use case…or sometimes you just want a table of results and that is your analysis, and you can just pull that out of Hadoop,” he explained.