Automatically Structuring Unstructured Corporate Websites for Producing a Company Search Engine

By   /  January 21, 2009  /  No Comments

Executive Summary

We have used a sophisticated array of AI/Machine Learning systems in combination with statistical methods, background knowledge and expert defined rules engines, to create, entirely automatically, a structured database with high quality information. The example we have produced contains structured company records and fields for over 2 million IT and telecoms companies using data taken from their websites.

Executive Summary

We have used a sophisticated array of AI/Machine Learning systems in combination with statistical methods, background knowledge and expert defined rules engines, to create, entirely automatically, a structured database with high quality information. The example we have produced contains structured company records and fields for over 2 million IT and telecoms companies using data taken from their websites.

The system has enabled us to,

  • crawl the web (we have currently completed approx 100,000,000 .COM, .NET, .US, .BIZ sites), identifying from these those that are company websites. For the first pass we have focussed on IT companies, but have now moved on to other types, e.g. general industry categories
  • identify and categorize approximately 2-3m IT companies
  • identify, extract, derive and compile a range of fields  for each company
  • apply a combination of techniques to boost data quality and accuracy
  • store and query the results in a structured database.

The precision and recall of the data extracted is high, with a number of fields between 97% and 99% accuracy.  This is due to the combination of factors including: a multi-AI engine architecture using different AI systems and statistical approaches combined, a series of results-booster algorithms and a ML, iterative data quality boosting process.

We believe that this type of approach for structuring unstructured content will greatly assist in building a semantic web, i.e. a web of data where we know with very high accuracy what a piece of information is, not just what is says.

Processing Environment

Processing times using pre-production scale systems are as follows:

Hardware/Internet  – 5 calculation servers, (8 core, 16GB memory per server), 100 Mbit internet channel,

Scope – processed 100m URL’s and 2-3m IT companies extracted in approximately  4 days

Results – Structured database created, 20+ fields extracted/derived/compiled  per company and data quality refined and boosted to 90%+ accuracy rates. Fields include, company name, company description, products, product descriptions, addresses, aliases, phone numbers, executives etc.

Thus, using the above process on totally unstructured information, we can create, in a matter of days, a database with company information that would take a team of 2,000 data entry personnel one year to create.

Structuring the unstructured Web is key step to achieving the Semantic Web

The Web largely consists of untagged, unstructured, user-generated content. This is true regardless of whether a website is a blog, a newspaper site, a company website, on online retail site or any other kind of site.  The lack of structure is an issue for certain types of query where the query requires some understanding of what the data is, not just what it says.

Similarly, the usefulness of social networks such as MySpace or Facebook is linked to users being required to enter information in a structured or semi-structured way. For example, when a user enters their name in Facebook during registration, then any other user or search engine can easily understand and use the fact that the information entered in this field is in fact their name.

This kind of re-usage of information is not possible for the vast majority of the Web as information is predominantly unstructured. i.e. we know what it says, but we don’t know what it is.

Semantic Web is a term that applies to information on the Web that has been  annotated in such a way that determines what information is. In this ideal state, most of the information on the Web would be accessible in a way similar to the way in which information in social networks is accessible and usable.

Whilst there are many ideas and initiative in progress, the main challenge for Semantic Web initiatives lies in converting the vast amount of unstructured information that is present on the Web to a structured or semi-structured state. 

Our company, aiHit, has developed technology that enables information found on unstructured websites to be converted to structured data and stored in a structured database. It is able to do this rapidly and with minimal human intervention and to level of accuracy that, in our tests, exceeds that of human input.

Creating a company search engine by structuring unstructured information

These days, it is very unusual for an IT company not to have a website. The vast majority, if not close to 100% of all IT and telecommunications companies have websites.

Despite this fact, with current query technology it is not easy to answer, say

  • who are all the content delivery network companies?  or
  • who are all the fabless semiconductor companies? or
  • who sells Sage Accounting implementation services? Or
  • Where are they?
  • Who invests in companies like these?
  • Etc.

All this data is present on the web but is unstructured. We don’t know which of the websites are corporate websites, what these companies are called, what they do, which products, services and solutions they offer, where they are located and so forth. We would be able to answer these questions, if the relevant information was stored on websites in a structured way.

We have created technology that solves this problem. We have crawled all .COM domains, and using artificial intelligence and machine learning techniques, we have identified:

  • All corporate websites
  • Name of the company
  • Description of the company
  • Products, services, and solutions offered by the company
  • People’s names, titles, and biographies
  • Customers, partners and investors
  • Office locations
  • Presence of job offers

of over 2m IT and telecommunications companies.  An example company profile in the company search engine looks as follows:

It should be noted that all of the above information is produced entirely automatically, i.e. untouched by human hand. The information is automatically generated by the AI engines directly from the unstructured company website. The shown example is a publicly listed company. However, over 98% of all companies in our database are privately held companies.

The information on companies is stored in a structured database and is therefore easily accessible. The beta system to examine and query this data is viewable at

Data quality of information is very high due to continuous machine learning

In order to reliably measure the quality of data produced by our systems, we have created a gold corpus of company data.  The information across all fields is typed in by our data quality team and is continuously kept up-to-date. We use this gold standard both to train parts of the system and to compare with the quality of data automatically extracted from websites to that typed in by our data quality team. This enables us to measure the precision and recall of our data. (Precision is the accuracy of the information extracted and recall is the extent to which information present on websites has been extracted by our AI engines.)

In addition the gold corpus shifts data set continuously to avoid “specialization” of the learning and measuring systems and the inherent errors that this would produce over time. Thus our data quality and learning systems are continually improving.
For example, the following graph shows how the precision of information in three fields that we extract has improved over time. The fields are the automatically generated name of the company, the generated company description and the quality of information in the fields linked to people found on the corporate website (person name, job title, biography).

Precision of information – example fields

In the cases of the company name and company description, the initial results produced by our AI engines were around the 75%-80& mark. Over a five months period from August 2008 to January 2009, the data quality in the these two fields was significantly improved is currently fluctuating between 97% and 99%. In the case of precision of extracting information on people, our initial results were significantly lower at 5%. However, in a short period of one month, we were able to improve the quality of extraction to 75% and subsequent work has enabled us to improve precision to 99%.

The extent of recall of information has progressed similarly.

Recall of information – example fields

Over five month period, we have improved the systems to correctly identify the name and description for 97% of all captured companies. We have similarly improved the identification of people on company websites. Some 81% of all people found on a website are currently found and extracted by our AI systems. There are two reasons why this percentage figure is not in the high 90% range. First, we have not particularly focussed on the recall rate of this field, choosing instead to focus our resources on precision. Secondly, some websites utilize robots.txt instructions that ask robots not to crawl their people data.

Structuring unstructured Web content with high accuracy and high speed

Our database of companies currently holds information on over 2 million IT companies in some 20+ fields of information. This information is generated fully automatically by a small array of standard Intel processors. The information itself is produced at very high precision and with very high recall rates. All of the above is executed at very high speed.

To generate and keep up to date a comparable database using data entry by humans would require a team of at least 2,000 well-trained data entry personnel.
The accuracy rates that can be achieved by this kind of personnel are in fact comparable to the data quality that can be achieved by our AI engines. Indeed, some of the problems on measuring data quality accurately and reliably are related to human entry errors, as opposed to issues with our AI engines.

We have trained our AI engines to recognize specific fields from corporate websites. Also, we have trained the engines to recognize corporate websites and to be able to distinguish them from other kinds of websites. However, in principle, we can imagine that the approach that we have taken for identifying certain facts can be applied to many cases of semantically annotating and therefore structuring unstructured content.

If computers can structure 2 million company website in three days, a job that would take 2,000 data entry staff a full year, then it becomes obvious how we may advance towards a semantic web. We believe that, using this approach, in a short period of time machines can substantially semantically annotate a large segment of the World Wide Web.

Final Thought

The power of social networks is partially linked to the fact that information is structured and can be easily re-used and linked. Companies have already entered their information in a social network. This social network is the Web.

Using our approach, regardless of the structure of the content or the structure of the site, we can automatically create a quasi-social-network where companies are present, categorised and related by the fact that they are present on the Web.

Going forward, we believe that most, if not all information on the web will be linked in such a way. So, in order to share information with your friends, customer, or colleagues, you won’t have to enter it into a specific database. In fact, you will be able to enter it anywhere. At that point, we will have the real semantic web.

You might also like...

Case Study: Polaris Puts Data Analysis in the Service of Defeating Human Trafficking

Read More →