Guest post by by Bryan Bell, VP of Enterprise Solutions, Expert System
Classification systems give structure to massive volumes of information, with the overriding goal of increasing discoverability. Organizations are working to manage large sets of data by classifying it into hierarchical trees (or taxonomies) based on the commonality of the content. You could say that classification is much like a multi-level sort, grouping similar information across multiple category classes.
Classification systems make it easier to understand and process large volumes of data by assigning sets of rules to a node within a classification tree. Various classification methods are being used to apply knowledge to the nodes via a set of specific rules. The challenge is building and organizing a system in a logical order that covers a multitude of user perspectives–building the proper categories, assigning the proper classification to those categories and describing what belongs in each category.
The development of classification systems and the management of data has quickly become a science. Generally speaking, a classification system will contain several parts: 1) The collection itself, 2) A classification hierarchy (tree) that categorizes documents by topic, 3) Sample documents describing the type of content to be classified within each category/node of the hierarchy and 4) An information platform that drives collection of content from the appropriate data sources and then places the content in the correct category.
What is the right method to implement an automatic categorizer?
A variety of technologies have been developed in an attempt to automate the process of classification. But essentially there are really just three main approaches that have been utilized over the past decade: the “bag of keywords” approach, statistical systems and rules based systems. How do you know which approach is best for you? Wading through the marketing clutter can be difficult, but it’s important to choose the best solution for you based on facts (and not claims):
1. Manual approach. The “bag of keywords” approach requires compiling a list of “key terms” that describes the type of content in question, and using them to develop a metadata list to classify content to a specific node. One problem with this approach is that identifying and organizing a list of terms (preferred, alternate, etc.) is quite labor intensive, is not scalable, and in the end, doesn’t address the ambiguity in language. In addition, such a system must be continuously updated as content within the information being gathered is ever changing. This is the oldest method and time after time, it has proven to be inefficient, inaccurate and simply not scalable.
2. Statistical approach. Statistical systems use a “training set” of documents that talks about the same topic and uses different algorithms (Bayesian, LSA or many others) to extract sets of key elements of the document to build implicit rules that will then be used to classify content. These systems have no understanding of words, nor can the system administrator pinpoint why specific terms are selected by the algorithm or why they are being weighted. In the event the classification is incorrect, there is no accurate way to modify a rule for better results. The only option is to select new training documents and start again. As content changes, new rules must be recreated to keep up with changes in the content being ingested. Finally, let’s not forget that in reality, most organizations do not have a training set of documents for each node of the category, which inevitably causes accuracy and scalability issues. This approach sounds appealing to many because it seems fully automatic (although training requires a significant amount of time and manual work) and sounds wonderful due to the savings in time and manpower. In reality, the idea that a computer system can magically categorize content has created exaggerated expectations which have not only proven to be unrealistic and unreliable, but also detrimental to the industry.
3. Rules based approach. In this approach, rules are written to capture the elements of a document that assign it to a category; rules can be written manually or generated with an automatic analysis and only then validated manually (a time savings of up to 90%). Rules are deterministic (you know why a document is assigned to a category), flexible, powerful (much more than a bag of keywords) and easy to express. The best option is to use a true semantic technology to implement semantic rules (rules can be written at the keyword or linguistic level), which makes it possible to leverage many elements to work faster and obtain better results:
A. A semantic network provides a true representation of the language and how meaningful words are used in the language in their proper context.
B. With semantics, rules are simpler to write and much more powerful, making it possible to work with a higher level of abstraction.
C. Once rules are written, the classification system provides superior precision and recall because semantic technology allows words to be understood in their proper context.
D. Once the system is deployed, documents that do not “fit” into a specific category are identified and automatically separated, and the system administrator is able to fully understand why it was not classified. The administrator can then make an informed decision on whether to modify an existing rule for the future, or create a new class for content that has not been previously identified.
Why Semantic Intelligence is the best option.
Semantic technology is a way of processing and interpreting content that relies on a variety of linguistic techniques/processing, including text mining, entity extraction, concept analysis, natural language processing, categorization, normalizing, federating and sentiment analysis. Semantic technology allows for the automatic comprehension of words, sentences, paragraphs, and entire documents, and is able to understand the meanings of words expressed in their proper context, no matter the number (singular or plural), gender, verb tense or mode (indicative or imperative).
As opposed to keyword and statistical technologies that process content as data, semantic technology is based on not just data, but the relationships between and the meaning of the data. This ability to understand words in context is what makes automatic classification possible, and what allows it to not only manage the chaos of our data, but optimize it for even further analysis and intelligence.