Later this year, expect to see an open source version of Treo, a semantic search and question answering system designed to help organizations deal particularly with the variety problem of Big Data. A Digital Enterprise Research Institute (DERI) thesis project that’s headed up by fifth-year PhD candidate and Amtera Semantic Technologies co-founder André Freitas, Treo (which means ‘direction’ in Gallic) aims to take on highly heterogeneous databases with thousands or even millions of attributes, via a natural language and intuitive interface to “talk” to that data.
“Treo,” says Freitas, “is kind of an elegant algorithm to use distributional semantics for answering queries over graph data.”
The first component that should launch as open source, Easy-ESA, is part of Treo’s distributional semantics component (as of the end of September, it had yet to make its way through DERI’s IP process first, which takes a few weeks). Since semantic web dataset models are schema-less and can support more heterogeneity, there’s the challenge of semantically matching users’ natural language queries to dataset triples, he says. Distributional semantic relatedness, Freitas notes, provides an automated solution here. The distributional semantic model offers a way to match query terms to dataset terms, using semantic information embedded in large textual resources available on the Web such as Wikipedia.
“You take a large collection of text, like historical data from newspapers, and using distributional semantics you can build very comprehensive, but not that precise, semantic models that, if you know how to apply them to your problem, or how to inject distributional semantics into your algorithm, you can make use of a huge knowledge base automatically extracted from that [data],” he says. “It’s a scalable way to address the vocabulary problem,” one that he says address the limitations of WordNet-based solutions, and returns to users a concise list of semantically-related results to their queries.
With Treo, organizations can take any domain-specific dataset in a very heterogeneous format that is hard to query or to understand what the schema is to query it, and build a natural language interface on top to do semantic-like search, he says. Freitas says the team working on Treo has tested it with datasets including some internal datasets, such as information obtained by sensors in place at DERI for energy intelligence to support sustainability efforts. “We’re using Treo to query that, and more recently we are trying to extend Treo to cope with unstructured text. So we are indexing the unstructured text of Wikipedia together with DBpedia to see what kind of queries Treo can answer.” Check this video to see some of that in action, as Treo acts as a doorway to a Do-It-Yourself Jeopardy Q&A system, a la IBM Watson.
So long as an enterprise has its data in a graph-like format, Treo can work with it, even if it’s not specifically Linked Data or RDF triples. “Businesses, academics or startups can benefit from trying to think on the scenario of what happens when you don’t have constraints to representing your data, what happens when you put flexibility into this process? What kind of product or application can you develop on top of that?” says Freitas. “Seeing distributional semantics working in real-world scenarios is very exciting, and even more exciting is the real-life potential application of this to address the challenges of building intelligent apps.”