There’s more than one way to get a taxonomy. A company can go out and buy one for its industry, for instance, but the risk is that the terms may not relate to how it talks about content in its own organization, and the hierarchy may not be the right fit either. That sets up two potential outcomes, says Chris Riley, VP of marketing at Pingar: You wind up having to customize it, or with users who just ignore it.
It’s possible to build one, but that’s a big job and a costly one, too – especially for many enterprises, where there hasn’t traditionally been a focus on structuring content and so the skills to do it aren’t necessarily there. While industries like publishing, oil and gas, life sciences, and pharma have that bent, many other verticals do not. In fact, Riley notes, they may realize they have a content organization problem, but not that what they’d benefit from to address it even goes by the name ‘taxonomy.’
Pingar’s looking to help out those enterprises that want to bring organization to their content, whether or not they’re familiar with the concept of a taxonomy. It just launched its automated Taxonomy Generator Service that uses an organization’s own content to build a taxonomy that mirrors its own way of talking about things and its understanding of relationships between child and parent terms.
There are a few reasons why organizations that haven’t invested much time in structuring content before are starting to care enough now that an automated taxonomy generator would catch their eye, he says.
“They’re reactive to something big like lost IP or issues with their search not working, or issues with being out of compliance for regulations and being ready for e-discovery,” says Riley. When it comes to the latter, for example, “you can’t say to a judge that you did a search and you think you used the right keywords to find the content you’re looking for. Being able to drill into content from a hierarchical or refined point of view is very important, so that you can say that every document we know that is tagged with x, y and z we identified and we did discovery on that.”
Pingar technology includes the Pingar API server, SOAP-based technology that lets developers leverage entity extraction and map taxonomy relationships for documents, emails and so on. Its customers in sectors like content management that already have a specific dialect they use when referencing content can feed in their existing taxonomy, so that instead of having a full slate of keywords and taxonomy terms returned, returns are reconciled with their existing taxonomy and with their preferred taxonomy term to ensure that all applicable information is searched, saved and stored systematically.
For instance, an existing taxonomy may use the word feline but what’s read and extracted from a document is the word cat, so what’s mapped and returned is feline. That same entity extraction engine is rolled into its new automated taxonomy engine to crawl an organization’s file shares and email boxes and build a taxonomy based on what’s there. “We also built a user interface that lets you modify and maintain these taxonomies,” Riley says.
Semantic web technology is at the foundation of the Taxonomy Generator. Access to Intermediate processing data is captured and stored using RDF triples, though users can export to Sharepoint, .CSV or whatever formats they prefer. “As an end result you have a customized taxonomy based on your content,” Riley says.
To go beyond the generated taxonomy, and provide access to external structured knowledge, the solution supports connections into the Linked Data cloud (Freebase and DBpedia, for example). IP and patented technology around disambiguation also feature in the Taxonomy Generator so that irrelevant extracted concepts can be discarded (Apple Corps when Apple Inc. is the relevant result, for example).
“Sometimes we call what we do document understanding or enriching unstructured content,” says Riley. “Basically, it’s saying let’s get you organized and automate your organization, your compliance, and more.”