Organizing Unstructured Content

By   /  September 19, 2011  /  No Comments

By Jim Wessely, President

Advanced Document Sciences

In knowledge management programs for large enterprises and organizations, a lot of valuable knowledge gets explicitly captured from discussion threads, answered questions, research papers, saved articles, presentations, and a variety of other document forms and formats.  These documents will typically refer to more than one concept or topic and can hold numerous sub-topics.  These concepts will have specific or implied relationships with one another within an individual piece of knowledge content.  Other knowledge content may contain the same concepts, but with a different mix of associated topics.

Making valuable retained knowledge accessible to a community of individuals with similar needs and interests is often best accomplished at the concept or topic level.  In other words, let people find the information they need by what it is actually about.  This is inherently different than providing access to content via attributes such as which business unit it came from, who created the document or the creation date, whether it is a procedure or a joint study, or numerous other types of attribute data that doesn’t tell the knowledge seeker what a document is about.  While these attributes can be very helpful, it is very often the case that identifying the concepts and topics in knowledge content is the key to providing knowledge-based access to an organization’s intellectual and knowledge capital.

Modeling concepts and topics

A taxonomy or ontology for knowledge access should represent all of the significant concepts in a given domain of interest.  A complete set of concepts and topics along with their relationships should be represented.  As it turns out, hierarchical structures in the branches of a taxonomy or the classes of an ontology are capable of doing this pretty well.  If well designed, the context that surrounds these concepts and topics can be derived from the hierarchical structure, too.  Properly designed and constructed, a taxonomy or ontology for knowledge access applications will provide users with a way to rapidly and intuitively navigate through conceptual and topical relationships in a way that not only provides context, but also provides a means for discovery and gaining new insights around these topics and the concepts.  But the relationships between concepts are often quite complex so crafting a good representation of these relationships in a taxonomy or ontology can be challenging.

A diagram representing these complex relationships might help to illustrate this challenge.  In the illustration below let’s say that different concepts are identified by different colors.  The same or very similar concepts have the same color.  Arrows represent relationships between the different concepts.  In this diagram there are three knowledge content “objects” (documents, discussion threads, etc.).  As we can see, each of these contains information about a common major concept.  These knowledge objects, however, will commonly also contain other significant or lesser concepts that may or may not have relationships to one another.

In this simple diagram it is easy to see that there will almost always be many different concepts with a very complex set of relationships when the number of content objects grows to a large volume.  Most large companies or organizations will have hundreds of thousands or even many millions of knowledge content objects in their share drives, knowledge or content management repositories, and squirreled away in individual hard drives.

Organization of document content vs. knowledge content

The organizational approach in enterprise content and records management systems has traditionally relied upon the use of document attribute data as the primary information source for content storage and retrieval.  Attributes such as business units, document types, expiration dates, functional areas, and similar things are common.  In the past few years content and records management applications have often used a “facetted” taxonomy design to define how the content and/or records are classified.  A facet is simply a set of terms that describe the value of a specific attribute type that can be associated with individual pieces of content.  Each facet often has a hierarchical structure that is usually limited to around two or three levels of depth.

As an example of this facetted taxonomy approach, let’s say that a content taxonomy had facets for content attribute information such as the products and services that the document is associated with, a business unit, and perhaps a document type (e.g., a contract, a form, and report, a product specification, etc.).  Selection of a value from each of these facets would provide a pretty distinct classification for the content object.  If optional attributes are added such as a security type or retention period there will be a lot of information available about the content object with just a few user selected values.  In this way a business unit manager, for example, could say “give me all of the reports for the second quarter from our Atlanta office” and the documents would be pretty easy to retrieve.

This kind of attribute assignment isn’t quite so easy to do with knowledge content, however.  A “report” in a knowledge application might be something like a thirty page analysis of a geologic formation or arctic weather patterns, or it could be something like an introductory overview of a specific type of cancer.  These kinds of content objects, documents in these cases, are bound to cover multiple topic areas and contain multiple concepts.  The cancer overview, for example, might easily contain topics on detection, diagnosis, treatment alternatives, side effects of the treatment, chances of recurrence, and other areas.  This makes it far more difficult to pigeon hole what a single attribute value might be, let alone identifying multiple possible values.  This brings us back to the complex model of concepts and topics, relationships, and context.  The tidy facets from a content system may end up morphing into a hierarchy of concepts and topics for a knowledge system.

This comparison between content management and knowledge management is not intended to say that the task of properly designing a taxonomy for content management is simple or trivial.  It is certainly not trivial when a large amount of enterprise content or corporate records are involved.  The point is that taxonomy design for traditional enterprise content is a task with a different purpose than that of designing a structure that represents knowledge.  The bottom line is that these are two different applications and they should employ different considerations and design approaches.

Multiple Categories

Since knowledge content often contains multiple topics, it seems clear that a knowledge object might be categorized into more than one topic.  Let’s look back to the cancer overview document mentioned above.  It would seem likely that a taxonomy or ontology of cancer topics would cover topics on diagnosis, treatment, side effects of treatment, and many similar topics.  The cancer overview document would match a number of these topics, so it probably should be associated with all of the appropriate topics.  This is where the “aboutness” of the content comes into play.  This can also represent one of the biggest differences between working with knowledge content and more traditional enterprise content such as, say, sales records or production reports.

As it turns out, people are pretty good at classifying documents with a single mix of metadata from multiple facets, although inconsistencies between one person and another are common.  People are often not very good at assigning multiple topic values to content, though.  Selecting and assigning multiple topics to a piece of content takes too much time and effort so they often lose interest.  People also tend to think about content differently, so they will often categorize content differently based upon whichever aspect of a document is of the most interest to them.  In this situation, a well set up auto-categorization application can do a better job than humans.  This statement is bound to get some push-back from people with indexing backgrounds, but experience indicates that people are not very good at categorizing large volumes of knowledge content into multiple categories in a manner that is both consistent and comprehensive in regard to covering all of the appropriate topics that might exist in knowledge content.

Auto-categorization for large volumes of knowledge content

As just mentioned, large volumes of knowledge content are often well suited to auto-categorization.  This technique uses some form of analysis of the content itself and can be referred to as content-based categorization.  The tools and methods most commonly used for auto-categorization are text analytics or image analysis if the content is imagery and doesn’t contain much, if any, text.  Most knowledge content contains text so content-based categorization can often be used to automatically associate this content to a taxonomic structure or with classes in an ontology in an effective manner.  A well designed and implemented auto-categorization application can provide a high level of quality and consistency in categorizing content that contains complex topical and contextual relationships.

The technology, methods, and techniques used in auto-categorization applications have matured to the point where it can be quite practical to use in a variety of implementation scenarios.  It can be expensive to implement, but can often haves a very high return on investment.  In situations where there are very large volumes of content, auto-categorization becomes an enabler where it is impractical or cost prohibitive for people to do the categorization.

Thinking about your content

If your company or organization is thinking about implementing a knowledge management program, it would be beneficial to take a little time to deeply examine the types of knowledge content you will end up working with.  A good knowledge library is a highly valuable tool to promote reuse of the knowledge capital that your people have worked to gain.  The more you know about your content, the better you will be able to design and implement the knowledge library.

About the author

Jim Wessely is president and co-founder of Advanced Document Sciences, a consulting firm with a primary focus upon enterprise information organization and access. He has worked with unstructured information technologies since 1985, and spent many years researching and designing application solutions using text analysis, text mining, and unstructured information technologies. This background led Mr. Wessely to taxonomy design and implementation through content analysis. Mr. Wessely previously worked for IBM Global Services, where he helped clients to develop strategies and solutions for enterprise portals, content management, taxonomies, text analysis, and text mining. Prior to his work with IBM, Mr. Wessely was the principal architect for numerous advanced computational solutions in DuPont’s Central Research & Development, where his primary interest was unstructured information technologies and global scale information portals. Mr. Wessely has been a frequent presenter at conferences in both the United States and Europe on diverse topics such as enterprise content strategies, information management, taxonomy, auto-categorization, advanced information access, and personalized information delivery.

You might also like...

Data People Must Build the Bridge to Your Cyber Security People

Read More →