Dandelion, the service from SpazioDati whose goal is to delivering linked and enriched data for apps, has just recently introduced a new suite of products related to semantic text analysis.
Its dataTXT family of semantic text analysis APIs includes dataTXT-NEX, a named entity recognition API that links entities in the input sentence with Wikipedia and DBpedia and, in turn, with the Linked Open Data cloud and dataTXT-SIM, an experimental semantic similarity API that computes the semantic distance between two short sentences. TXT-CL (now in beta) is a categorization service that classifies short sentences into user-defined categories, says SpazioDati.CEO Michele Barbera.
“The advantage of the dataTXT family compared to existing text analysis’ tools is that dataTXT relies neither on machine learning nor NLP techniques,” says Barbera. “Rather it relies entirely on the topology of our underlying knowledge graph to analyze the text.” Dandelion’s knowledge graph merges together several Open Community Data sources (such as DBpedia) and private data collected and curated by SpazioDati. It’s still in private beta and not yet publicly accessible, though plans are to gradually open up portions of the graph in the future via the service’s upcoming Datagem APIs, “so that developers will be able to access the same underlying structured data by linking their own content with dataTXT APIs or by directly querying the graph with the Datagem APIs; both of them will return the same resource identifiers,” Barbera says. (See the Semantic Web Blog’s initial coverage of Dandelion here, including additional discussion of its knowledge graph.)
Barbera says the advantages dataTXT gets by leveraging its knowledge graph include resiliency to gramatically malformed texts, and the ability to work well on short sentences. It also makes it easily applicable to different languages (English and Italian are first in line), and easily extendable to user-defined terms and categories, for example, without requiring long and expensive trainings. “dataTXT-NEX makes it very easy for developers to link their content with DBpedia, which is at the core of the Linked Data Cloud,” says Barbera. “Once content is linked to the LOD cloud, developers can tap into an enormous and growing network of linked information they can leverage to make their apps smarter, richer and easier for the final users’ data.”
The idea at large, he says, is to lower the barrier for accessing advanced semantic technologies, so that small companies and independent developers can express their creativity. “This will boost innovation for the benefit of all. Our vision in the mid-term is to build a knowledge-graph as-a-service that everybody, not just the big corporations, can easily access and leverage to enhance their business.” he says.
Currently it’s working with several partners, from startups to big corporations, in fields including mobile tourism, journalism, banking & finance and security, he says. “They’re using our services to enrich existing databases, build smart search engines and recommender systems on document collections, add location knowledge to web apps, automatically tag product descriptions on e-commerce sites, create infographics and more,” says Barbera.
While the company initially planned to first release its Datagem APIs to let users query Dandelion’s knowledge graph in a structured way (via a more simpler query language than SPARQL) before releasing APIs of the dataTXT family, it discovered during its private beta “a very strong demand for enriching unstructured content (especially short content)….DataTXT APIs are an easier way to approach the complex world of the Web of Data. So we decided to reverse our approach,” Barbera says.
It learned many other things from its private beta of more than 500 users from all over the world. as well. “We definitely want to allow users to contribute to the graph and curate their own dataset. However, we are not yet there in terms of usability of the technology. Freebase is a great example in terms of usability but it’s still far too difficult to use for most users,” he says. “OpenRefine is another great step in lowering the barrier to curating and enriching tabular datasets. Unfortunately it’s not designed as a multi-user application and it’s not yet scalable to very large datasets. In the next months we’ll work with the amazing OpenRefine community to extend its functionalities and overcome some of its limitations. This will set the foundations for opening up Dandelion’s internal data curation pipeline to the users in the future.”
The new year also should see the company working with partner data providers to grow its knowledge graph.