Ontology based mining of digital text – Internet monitoring for Investor Relations

The evolution of the semantic web brings new possibilities to handle the information overload. This paper focuses on the motivation to integrate two applications and the usage of semantic content for information managers. Analysts and executives need to understand today the value of semantic technologies in relation to business intelligence and decision making support. There are a number of mature semantic technology categories including automatic annotation, information extraction techniques, text mining and semantic navigation and search within the content.

Semantic technologies make it possible to exploit unstructured information in order to increase relevant knowledge within the organization. The described solution is based on the prototype experience of 8iris and Ontos. 8iris is a specialist in IR solutions and Ontos in semantic technologies. Both companies agreed to integrate their solution in order to provide more knowledge to the end user.

Challenges to Investor Relations

The expectations of the investor community towards the companies have increased due to the market situation. Within the annual or quarterly meeting the company has to be ready to answer many questions related to the market, the competition and to trends. According IR surveys the strongest annual reports are the ones that can cover the company’s forward-looking strategy, the company’s view of its markets and business, with the CEO’s insight into future trends and major new developments. For this purpose it is important to analyze all available sources and to extract and understand the information within the data provided. Most of it is in unstructured format within web pages (Internet) and text reports and this leads to high costs due to manual reading, tagging and aggregating of information. So the key challenge is to establish a monitoring system that will gather competitive intelligence data. The output of the analysis will be used for internal and external purposes. Competitive intelligence for internal cover areas like won or lost deals, new deals won by the competition or informing management about market share. On the other hand we have the external side with the analysts or shareholders. It is quite possible that they talked to your competitor and therefore the analysis of the IR team should cover as much as possible about your competitive landscape. The knowledge gathering questions should cover things like:

  • Is the industry stable?
  • Are your executives respected and how are they perceived in the news (positive/negative)?
  • How do market participants view your competitors?
  • What are the market trends?
  • Which companies interact together?
  • Turnover in key employees?

The above list of questions is just an extract and can be used to formulate the domain ontology (Pic.1).

Picture 1: CI Ontology

Picture 1: CI Ontology


A user who is interested to find information about the above set of questions would perform his search by using the Internet, especially the search engines (Google, Yahoo, MSN etc.), buying analysts reports, subscribe to RSS feeds and press releases. Besides the problem of finding trustworthiness of the sources the user has to invest a lot of his time in formulating his terms in the queries and in understanding the content of each document from the search result. Remember most of the data is unstructured and the simple keyword-based search will lead to irrelevant information. Additional problems that conventional search engines have are related to missing semantic links between objects. Semantic links could be a person related to a company, e.g. Mr Smith has been appointed as CTO at a Telecom company. As the current Internet doesn’t hold the semantic metadata we can’t perform any precise queries. This means we can’t define a computer driven aggregation method based on keywords in order to aggregate the information. A possible solution lies in using semantic technologies that provide tools and methods for extracting information and therefore create metadata that can be used by the computer.

Semantic Technologies

The Internet (World Wide Web) has drastically changed the availability of electronic information. The exponential growth makes it increasingly difficult to find, to access, to present and to maintain information. Most pages use representations based on format languages like HTML or SGML and employ protocols that allow browsers to present information to human readers. The content, however, is mainly presented by natural language. The serious problem is on how to access the important content that is stored in the digital text based on natural language. A user using classical tools like the search engines (Google, Yahoo, MSN etc.) has problems finding the right piece of information. The user gets lost in the huge amount of data, often irrelevant, is imprecise and gets pointed to many other pages through links. In order to evaluate the content the user has to open the pages and read each article before extracting the knowledge he needs. This aspect highlights clearly how difficult it is to fulfill the competitive intelligence monitoring as described in the previous chapter. Semantic technologies today are providing tools and methods to overcome such problems. For the purpose of Internet monitoring and competitive intelligence (CI) following techniques have been applied:

  • Semantic annotation of digital text based on an ontology driven natural language process (NLP)
  • Storing of meta-information in RDF format in a knowledge base
  • Web Services to search, navigate and analyze the semantic content

The semantic annotation processes helps to extract the named entities and the semantic links or relations. The process is driven by an ontology that defines the objects and relations that can be recognized automatically. The process „understands“ written text and extracts it as structured information (Pic.2). The extracted metadata is stored in the RDF store and can be used for various purposes. The benefit for the user is that he can create precise queries to the RDF store in order to receive the information he needs. The next step was to combine the existing IR solution from 8iris with the semantic technology from Ontos in order to provide a complete IR dashboard including information about the competition and the market. For this purpose Web services have been defined and linked to the existing dashboard in order to enhance the information.

Picture 2: Ontology driven information extraction

Picture 2: Ontology driven information extraction

Integrating semantic content with IR

The existing dashboard of 8iris (Pic.3) was extended by semantic content that is automatically aggregated from sources which were defined as trustfully. Based on various discussions, the domain ontology was developed first (Pic.1). The ontology is part of the information extraction process for natural language (Pic.4). The NLP system extracts the metadata including named entities (like person, company, date etc.) and the relevant relations and facts (like LocatedIn, IsEmployeeOf, Sentiment etc.). A speciality of the final system is also the process of merging named entities. The extracted named entities and relations are written to an RDF store. An application server provides the possibilities to generate different dashboard and analysis outputs for the IR user. The user has always the possibility to access the original article on the web by following the respective hyperlink.

Within the CI dashboard the IR user can select a variety of reports and charts. For example you can compare the visibility of your competition in the online news (Pic.5). Another important aspect is to monitor how your company or management is being perceived on the market by analyzing positive and negative statements. The sentiment analysis provides another good view on how your management is being seen (Pic.6). The two applications are integrated via Web services and Web Widget technology.

Picture 5: Market visibility

Picture 5: Market visibility

Picture 6: Sentiment analysis

Picture 6: Sentiment analysis


The current version is not final and will be continuously extended with additional charts and new named entities and facts.


After implementing the first prototype following main benefits have been achieved:

  • Keyword-based search has been replaced by clear defined objects with attributes and semantic relations;
  • The process of aggregating is automated and has reduced the manual operation and therefore reduced the costs;
  • Retrieving of data and creating different analysis is based on the RDF store and provides a higher precision;
  • Different documents from various content sources are connected based on their content, meaning on their semantics that has been automatically extracted;
  • Searching and browsing of information is driven by Web services based on simple to use navigation cards or a dashboard which also leads to more end user satisfaction.


The development of Ontos' solutions for Semantic Web is based on text mining systems, which process multilingual text collections within the context of domain models represented by ontology. In order to make the extracted information available for analysts and executives, appropriate intelligent services were discussed in this paper. Future development includes the extension of the domain ontology, new charts and the possibility of intelligent clustering. In discussion are also topics like XBRL in order to Mashup financial statements and automatic creation of summaries and digests.