Entity Extraction is the process of automatically extracting document metadata from unstructured text documents. Extracting key entities such as person names, locations, dates, specialized terms and product terminology from free-form text can empower organizations to not only improve keyword search but also open the door to semantic search, faceted search and document repurposing. This article defines the field of entity extraction, shows some of the technical challenges involved, and shows how RDF can be used to store document annotations. It then shows how new tools such as Apache UIMA are poised to make entity extraction much more cost effective to an organization.
It is often stated that over 90% of the useful information in any organization is stored in unstructured text documents – not in structured relational databases. And with the proliferation of technologies like RDF, RSS and Atom we are seeing the floodgates of text information opening for many organizations. And most enterprise architects will privately admit they have no cohesive architecture for leveraging unstructured information to help business users find critical documents and make better business decisions. But a new generation of easy-to-use open-source linguistic pipeline tools from the Apache foundation is allowing even small organizations with modest budgets to build highly modular text processing systems that can be setup and configured by non-programmers.
Going Beyond Simplistic Keyword Searches
In the past most unstructured information in any organization has been stored in unstructured documents in various filing systems. If you wanted to find a document you had to know where it was stored. Some organizations used tools such as Apache Lucene to extract text from these documents and index the words in these documents. Lucene and other similar tools allowed you to perform fast keyword searches and sometimes they helped you find the documents you were looking for. But keyword search is problematic in many ways. If the keywords you are using are different then the words used in the document you may miss critical documents. Documents also did not consistently code important concepts such as products, people, names and dates.
Here are three examples of where simple keyword search falls short:
- A user does a keyword search for all documents authored by "Bob Smith" but missed many of the documents they were looking for because the name "Robert Smith" was used in the author field.
- A user does a search for all documents with the Product Name X, but does not find many of the documents that they needed since Product X was actually called Product Y up until three months ago.
- A user is attempting to find all documents written in a specific time window but the documents do not have consistent date formats in the documents.
All these searches fail because the process of indexing documents based on simple the literal keywords in the document. The searches do not take into account the context of each entity and the information that can be inferred from this information. Keyword search systems usually provide no linguistically-aware tools to find precise names, products or dates within a document.
But in the last few years we have started to see a new generation of tools that enable semantic search. These tools are called Entity Extractors. (sometimes called named entity recognition or entity identification). Entity Extractors allow you to add precise type-specific annotations and metadata to any unstructured text document. This annotation is called "document enrichment" and it is a standard pattern that recurs in document processing. Yet unlike ordinary keyword searches, entity extractors are "smart" about knowing the type of data that they find in a document and adding intelligent metadata. Here is a high-level explanation of how entity extraction works. Assume you have a document with the following text:
In our meeting on June 8th, 2009 Bob Johnson disused the product X failure estimates.
When an entity extractor scans this text it adds the following XML-like annotations to the text:
In our meeting on <entity date="2009-06-08">June 8th< /entity > <entity person="RobertQJohnson">Bob Johnson</entity> disused the <entity product-id="123">product X</entity> failure estimates.
In the text above the entity keyword is wrapped around three entities found in the text. After this process is done, a simple query (for example XQuery) on the document will allow all dates, entities or products to be extracted and indexed in your search engine. But the entity extraction process can be easily customized to be aware of individual objects in your organization. For example the person entity extractor knows about firstname aliases. The person entity extractor knows that "Bob" is a common alias for "Robert" and it then can check the employee directory and put the correct full name of the person in the person attribute. The date extractor knows about many different date formats that might appear in text and can use document context (such as the year the document was created) to infer the precise date referenced in the document even though it was not specifically provided. The product name extractor can look up the new name of a product and associate it with a product database and insert the correct product ID. Future keyword searches can look up product history and search for all documents with multiple product names.
In the past Entity Extraction was very difficult for most organizations to implement. Entity Extraction software used highly complex linguistic grammar-based techniques as well as statistical models. These tools were complex to install, configure and customize for each project. Furthermore, each entity extractor was an island of its own and could not leverage knowledge gained from other entity extractors. Entity extraction was notoriously slow and entity extractors from one company would usually not work with entity extractors created by another organization. The typical Entity Extraction pipeline (see Figure 1) was typically very rigid and could not be modified by non-programmers.
Figure 1: Typical Entity Extraction Pipeline
Enter Apache UIMA and the CAS
In recent years this all has begin to change. A new standard called UIMA has started to gain momentum in the text processing community. UIMA (pronounced "you-ee-ma" stands for Unstructured Information Management Architecture. It is now an official Apache Incubator project. Just like Eclipse, UIMA has a "plugin" architecture to allow developers all over the world to create and extend components that are interoperable.
UIMA was originally developed within IBM’s research division to encourage linguistics researches to share components. UIMA has an innovative approach of storing standardized strongly-typed annotation directly in physical memory. This structure, called a CAS (Common Annotation Structure), allows users to plug in new annotators directly into a high-performance linguistic pipeline without needing to know how to write Java code. IBM realized that by making UIMA tools into an open source product they would encourage the creation of both higher-value and interoperable components by linguistics researchers. Using a central CAS permits each team to reused basic structures such as sentences, tokens or part-of-speech identification.
CAS has become the "secret-sauce" that has allowed a wide variety of developers to quickly build new entity extractors without having to know the details of text processing pipelines. With UIMA, if you can express your entities in a set of keywords or in a set of regular expressions you can usually write and test a new entity extractor in a day, not weeks or months. Figure 2 shows a sample screen image of Apache UIMA’s tool for creating a sample annotator that recognizes conference room numbers. This annotator was created by expressing conference rooms as a set of regular expressions.
Figure 2: Apache UIMA Annotator for Finding Conference Room Numbers
What does all this have to do with the Semantic Web? In the example above I used a simple XML entity markup format. But unless your are blessed with the luxury of using well documented document markup standards such as DocBook, TEI or DITA you may not be given precise formats on how your consumers want to query your documents for these entities. This is where the semantic web and RDF annotations will begin to be used. Instead of just publishing a simple ATOM feed for your documents you may also be able to publish an enriched ATOM feed that uses RDF annotations to markup text. You can begin by using simple vocabularies like the FOAF (Friend of a Friend) to annotate people’s names or microformats to annotated dates. If you have product or terminology specific names you can use RDF to add your own annotations.
The key is that the semantic web is starting to provide a mechanism to markup any HTML documents with these annotations. Example include RDFa and microformats today but are not limited to these formats.
Example: Business Card with VCARD Microformats
As an example you can create an UIMA "business contact" annotator that performs the following translation:
Semantic Solutions Architect
7400 Metro Boulevard, Suite 350
Minneapolis, MN 55439
Output: Business Card Microformat HTML Markup
<html xmlns="http://www.w3.org/1999/xhtml"> <head profile="http://www.w3.org/2006/03/hcard"> ... <div class="vcard"> <a class=" fn">Dan McCreary</a> <!-- fn is a full name --> <div class="org">Syntactica</div> <div class="title"> Semantic Solutions Architect <div> <div class="adr"> <div class="street-address"> 7400 Metro Boulevard, Suite 350 </div> <span class="locality"> Minneapolis </span>, <span class="region">MN</span> <span class="postal-code">55439</span> </div> <div class="tel">(952) 921-9368</div> </div>
In this case the "Business Card" Entity Analyzer not only processes the block of text but it also found the city, state and zip code sub fields within the address text.
Entity Extraction and Faceted Search
It is frequently the case that a person is looking for some document but the resulting document displayed is only a stepping stone to finding their target document. If the document viewing tool can extract a list of images, people and terms from a large document you can frequently present a list of "facets" of that document in the margin of the document. For example Figure 3 shows a right margin of a historical document that has references to images, persons and abbreviations used in the current document.
Figure 3: Faceted Search
Faceted search is one example of how automatically extracted data can be repurposed to perform a wide variety of tasks. When you combine a robust library of Entity Extraction components with tools such as automatic document categorization you empower your users to get more of the results they desire and fewer false search hits.
Entity Extraction is just one of the many exciting new technologies that allows organization to go beyond imprecise keyword-based search tools on the road to a more robust semantic search. RDF structures can be used to annotate HTML documents without disrupting their display on web pages. Tools like Apache UIMA are making it easier to automatically add metadata to your documents. Organizations need to start planning now to take advantage of low-cost open source tools to analyzed both internal documents and web-based RSS and Atom feeds.