Enterprises using semantic technology often come up against a problem: Not being able to scale their approaches across domains.
“We hear a lot about semantic approaches that work great when targeted to a domain – for example, you can train up an NLP engine for the hotel industry domain that knows ‘thin’ is a bad word when applied to it,” explained YY Lee, chief operating officer at customer intelligence vendor FirstRain at the recent SemTech conference in San Francisco. “But the amount of the business world to be potentially covered by semantic techniques – that limitation to train for specific domains cannot scale.”
An example: In a sales organization, sales teams need to understand what matters to their customers in a variety of industries, and often their customers’ customers, too, she said. First Rain has as its mission making sense out of all the unstructured information on the Internet that can play into efforts like that. It’s estimated that FirstRain processes hundreds of thousands of websites and millions of documents daily, extracting and bringing to the user’s attention only business-relevant content from news, blogs, PR, corporate web sites, government filings, and more.
To get where it is today, Lee said, the company had to come to the understanding that you cannot encode the structure by which you map the information. Instead, it had to create a flexible data model to reflect differing topologies across different information domains.
“Ten years ago we tried a taxonomy but they don’t really work because they are static,” Lee noted. It’s clear to her team, she said, that fixed taxonomies don’t represent the real world, as the method is slow to react to business changes, and struggles with inaccuracy when it comes to representing companies dealing with a lot of diverse markets.
“So we created a flexible data structure that could reflect the different atomic players and pieces in the business, and based on the information we see coming over we could [semantically] categorize and derive the structure of different business and relationships between entities. So, over time our internal data structures are driven by the information we process.”
Its semantic categorization technology uses patented algorithms and models to automatically structure all FirstRain intelligence and organize it into the thousands of industries, topics and business lines of what it terms its dynamically generated Business Web Graph, the company says.
Twitter is the most recent addition when it comes to web spots from which FirstRain can extract business-relevant content. Earlier this spring it announced FirstTweets. “With tweets and social content the information ambiguity could just kill you,” Lee noted. That plays to its introduction of its affinity scoring technology that is layered on top of its categorization capabilities, specially adapted for the task, and which aims to provide a better understanding of what a social media comment might really be about with the help of statistical analysis and matching its semantic characteristics with other families of things with which it seems to be associated.
“Affinity scoring must be a breakthrough for classes of information where there is a lot of ambiguity,” she said. “And the cool thing about it is that you can actually apply it in a way to create a virtuous self-improving spiral that works across massively different information domains. When you set up the correct feedback loop of affinity scoring and don’t encode to different domains, but let it swing across those you are trying to match things to, you can create a self-learning system.”
That’s an example of where the state of things are today when it comes to semantics that adapt to the occasion. Said Lee, “we are at the edge of being able to handle a lot of variety of information using very efficient processes.”