A Data Extraction System for Unstructured Documents

Click here to learn more about George Roth.

Let’s assume that we have a system that extracts information from unstructured documents.

There are two types of unstructured documents:

Type A: There are unstructured documents with known content (e.g. legal contract document, SEC filings, etc.). An essential property of these is that can be classified based on some keywords or keywords sets. The second property is that we know what information is in these documents, for example a mortgage contract always have a start date, a mortgage rate, rate rules, lender name, loan type, etc.

Type B: The second type of unstructured document is a document that cannot be classified based on keywords and have unknown content. For example an article from a newspaper, has no type, and can be anything, an article about a war, or an article about a movie depicting a war, an article about some environmental issue, etc. The Type B requires a semantic classification if the document doesn’t have some associated Metadata.

One of the problems of processing of unstructured documents of Type A is related to the measure of precision. Let’s assume that we want to extract a set of fields from unstructured contracts. We can identify the contracts because they have the word Contract in the title of each document. But we can have multiple contract types in our corpus of documents. We can have mortgage contracts, supplier contracts, support contracts, electricity contracts, etc.

We can define a list of fields of interests that are a union of the specific fields of all the contracts. We know that a mortgage contracts have different information than a electricity contract. But for the sake of simplicity, we decided to define a taxonomy (a list of fields that we will extract with a system) that is unique, and we know that not all the fields will be in all contracts. We are using our system to extract this data, it can be a machine learning based system or some other type. And we want to know the precision that we have. Which are the measures that we need to use ?

We need to be able to measure the following parameters in our system (TFPs and TFNs):

A true positive (TP) is a value that was extracted by our system and was confirmed by the QA as correct.

False positives (FP) are values extracted by the system but corrected after QA.

True negatives (TN) are values that were not found by or system and the QA confirms that the value for that specific filed in the taxonomy is not present in the document.

False negatives (FN) are values that our system did not find in the document but the QA determined that the fields are present in the document but were not found in the system.

The following performance indicators can be used: (we call this the PSA set)

Precision
The precision of the data extraction will tell us how many of the identified values are correct from the total number of values extracted.

The correct values are the TP, while the total values are TP + FP (correct and incorrect).

Sensitivity
The sensitivity will tell us how many correct values we retrieved from the total values that could have been extracted.

The correct values are the TP, while the total values in the document are TP + FN. As defined above FN are the values that the system identified as missing but the QA found the in the document.

Accuracy
The Precision and Sensitivity deal only with the extracted values, and do not take into account the values that are really missing and the system correctly reports them as missing. Accuracy is the measure that tells us how correct the system identifies ALL values, both existing and missing.

The problem is that for our example choosing a unique taxonomy for all the contract types we will not be able to determine the True Negatives and False Negatives, we don’t have the information about the expected fields for each document type. That is why we need to have specific field lists for each document type and to be able to identify each document sub-type correctly. In our example this means that we need to have a Mortgage Taxonomy, a different Electricity Taxonomy, a Supplier Taxonomy, etc.

Data Topics

A Data Extraction System for Unstructured Documents

Leave a Reply Cancel reply