Comprehensive support for semantic analysis across 20 languages (up from ten) is one of the latest additions to TextRazor’s customizable, open semantic analysis and text mining API, to satisfy what the startup says is increasing demand for sophisticated semantic tools that go beyond English.
The company’s technology has been in public beta for just a few months. It differs from other multilingual natural language processing solutions, says founder Toby Crayston, in that it strongly leverages linked data sources like DBpedia and the semantic web to disambiguate, normalize and filter extracted metadata with better accuracy, so that end users can build powerful multilingual classifiers regardless of the language of their documents.
Crayston came to creating his company after spending time working in R&D at Bloomberg, doing a lot with entity extraction and text analytics of financial news. During his tenure, he said it was clear that there was a big gulf between the latest semantic technology from academia and what large companies want from the technology to make money leveraging it. His goal was to bridge that gap, and doing so, he says, meant not just the bells and whistles but also accounting for things enterprise IT cares about, like ease of integration, the speed of the underlying engine and how easy it is to accomplish customization.
To meet such ends, the API can be easily integrated with any language that can send an HTTP request and parse the JSON response. It was written in heavily optimized C++ to be capable of processing thousands of words per second per core. with a distributed backend built to scale to billions of requests automatically when run in the cloud (t also can be deployed behind a firewall), the company notes.
As for customization, the focus is on accounting for the product names, companies or other things that matter to a particular enterprise by adding named entities, topic tags, synonyms, grammatical syntax and so on to its Prolog-based rules engine. Existing category ontologies or specific sets of topics can be brought into the mix. “What’s often overlooked, for example, is that most organizations are interested in classifying documents into their own custom taxonomies, usually with domain specific topics that won’t be picked up by off-the-shelf tools.” Users can post rules with all requests so that results can be mashed and combined with a custom taxonomy.
“So, you can provide very powerful queries as well as upload custom lists of entities,” he says. “If you have a very ambiguous brand name, for instance, you can say you are only interested in that brand name if it’s mentioned in context of cycling, if that’s what your company does, or certain other keywords to find the right matches.”
The added language support is important, he notes, requiring that companies think about it upfront than as an after-the-fact add-on. TextRazor put a lot of time coming up with a better way of detecting and disambiguating entities based on localized content and more language-agnostic data sources, he says.
“We’ve had a lot of demand for the last few months,” he says. The problem with a lot of languages is that there is not much linguistic data out there to help build models to use for classifying and disambiguating entities, he says. “We can use the language links between different concepts in Wikipedia and structured data from DBPedia and Freebase to build up a big language independent graph of concepts. In a low-resource language we can leverage this graph to provide more contextual information for improved disambiguation accuracy.”
Coming out soon are custom lists of entities to support requirements for verticals like the medical domain.