Bringing Semantic Technologies to Enterprise Data

by Paul Miller

World Wide Web inventor Sir Tim Berners-Lee declared the Semantic Web ‘open for business’ in 2008, celebrating the ratification of the SPARQL query specification by the World Wide Web Consortium (W3C); the organization of which he is Director. “I think we’ve got all the pieces to be able to go ahead and do pretty much everything,” he stated in an interview. “You should be able to implement a huge amount of the dream, we should be able to get huge benefits from interoperability using what we’ve got. So, people are realizing it’s time to just go do it.”

Research into the Semantic Web has come a long way since it reached wider public attention with the publication of a visionary article in Scientific American back in 2001, and numerous academic events take place each year at which researchers from around the world continue to share their latest advances.
Consumer-facing sites such as Freebase, TripIt, True Knowledge, Twine and others have had some success in attracting attention and investment, and discussion of ‘Linked Data’ is now moving beyond enthusiastic experimentation toward a mode in which viable business models are beginning to emerge. The perception remains strong, however, that semantic technologies are of only limited utility in enriching an enterprise’s own interactions with the internal data upon which it increasingly depends.

Mastering Data, Semantically

Silver Creek Systems' VP of Marketing, Martin Boyd, is one of a growing band to disagree. The Colorado company offers solutions for Master Data Management (MDM), and deploys semantic technologies to reduce costs whilst significantly increasing speed and precision for customers in distribution, retail, manufacturing and healthcare. Boyd suggests that there is “a huge divide between people who care about semantic technology and people with cheque books,” and goes on to stress the need to pitch compelling solutions to business problems rather than technology for technology’s sake. He argues that “‘Semantic’ should be mentioned in the second sentence of the elevator pitch, not the first,” echoing sentiments expressed by others including Verizon [VZ] Chief Scientist Michael Brodie, Telefónica [TEF] Director of Research & Development Richard Benjamins and Vulcan Capital’s Director of Knowledge Systems, Mark Greaves.

MDM is of critical importance across a wide range of business sectors, from manufacturing and distribution right through to the point of sale, and on into after-sales and support. At every step along the way, there is a need to unambiguously and easily identify parts, products, processes, or people. As Wikipedia notes, MDM “comprises a set of processes and tools that consistently defines and manages the non-transactional data entities of an organization. MDM has the objective of providing processes for collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing such data throughout an organization to ensure consistency and control in the ongoing maintenance and application use of this information.”

Silver Creek concerns itself with product data, which Boyd asserts to traditionally be a harder problem than the management of customer profile data undertaken by competitors such as Informatica [INFA] and Firstlogic (a subsidiary of SAP BusinessObjects [SAP]). Here, well understood probabilistic and pattern matching techniques can be applied to check and repair names, addresses and related attributes with a high degree of precision. With product data, on the other hand, high volumes and low margins combine with poor standards and a lack of coherent structure to create a set of problems that more traditional approaches struggle to resolve in a timely and cost-effective manner. Boyd cites a 2007 survey of business and technology trends by Ventana Research which suggests “80% of companies are not confident in the quality of their product data,” and “73% find it ‘difficult’ or ‘impractical’ to standardize product data.” Gartner’s [IT]Andrew White agrees, arguing “product data is inherently variable, and its lack of structure is generally too much for traditional, pattern-based data quality approaches.”

There have been various efforts to address this problem through the development of standards. The United Nations Standard Products and Services Code (UNSPSC), for example, is an international effort to formalize aspects of product classification. Boyd suggests that, as with many standards, there is insufficient adoption of the generic UNSPSC. Instead, implementers tend to extend and modify those codes pertaining to their own areas of interest, increasing internal utility whilst complicating the flow of information to and from third parties. A large retailer with which Silver Creek is involved, for example, shows more than 65,000 unique items for sale via its online store, sourced from over 3,000 suppliers across 1,000 product categories. Growing support for another cross-industry initiative, the Global Data Synchronisation Network (GDSN), sees this retailer and others able to reliably exchange generic information on a product’s dimensions, weight, storage temperature; the information that makes sure distributors maximize container utilization, take appropriate regulatory precautions and arrive at a destination where their will be space to receive their load. Boyd suggests, however, that GDSN is of little utility in describing differentiation within product categories; are the 15,000 hand bags you know fit within your shipping container pink or black, and do they have zip or button fasteners? Do the cars coming off the transporter and into your showroom have air conditioning or satellite navigation, and how fuel efficient are they?

Leveraging semantic technologies within their DataLens System, Silver Creek assert that they can make sense of the real world data their customers need to integrate and manage, and deliver a solution that can be deployed and maintained by subject matter experts close to the products, rather than the more remote IT department. Differentially structured data from all points in the supply chain can be analyzed, and the meanings of codes, phrases and abbreviations either extracted from existing rules and terminologies (such as UNSPSC or GDSN) or inferred by the system by reference to the context in which it encounters the data. ‘PC’ might be deduced to mean ‘Personal Computer’ in a context filled with Megahertz, Gigabytes and Windows, ‘Police Constable’ in a law enforcement context, and ‘Polycoated’ in the context of sterile latex medical gloves. Once understood, terms can be harmonized and subsequently consistently expressed in the different forms appropriate to purchasing agents, shelf stackers and consumers.

A leading Healthcare provider contracted Silver Creek to assist with managing information on some six million items, regularly drawn from across 1,700 hospitals. Silver Creek’s technology reputedly improved on the existing manual system by over 30%, reducing operating costs by $4-5 million per annum. A spokesperson for the company said “in one comparative test, we achieved a 63% match rate after 16 person days of manual effort. With the DataLens system, we achieved an 80% match in less than one hour.” In a second example, systems consolidation following the acquisition of a company with a $3 billion market capitalization by a competitor three times its size saw a 90% reduction in project cost and effort after Silver Creek became involved.

Semantics in Business Intelligence

Bradley Allen agrees that MDM is ripe for enrichment with semantic technologies, and is joined by Oracle’s [ORCL] Jeff Pollock in suggesting that Business Intelligence should be another obvious beneficiary. Allen notes, though, that the market for Business Intelligence solutions is mature and largely controlled by offerings such as BusinessObjects and Siebel, both now owned by giants of the enterprise computing space. Despite asserting clear benefits to a semantically enriched approach, he argues it is difficult to penetrate a mature and conservative market already dominated by a small number of incumbents. New ‘semantic’ solutions often look very similar to current offerings on the surface whilst necessarily being sufficiently different below the surface to cause disquiet amongst conservative buyers used to something else.

SAP BusinessObjects’ Chief Architect, Yannick Cras, is quick to recognise the “potential for adding metadata to the Deep Web,” but points to the fact that setting up and adopting semantic technologies in the enterprise still represents a very significant investment today, due to their expressiveness and  complexity. Mainstream adoption of semantic web technologies will require the community to significantly reduce their TCO Cras points to BusinessObjects’ efforts to simplify analysis of enterprise data, describing typical customers as more comfortable with Microsoft Excel than more complex analytical packages. He recognizes that the Semantic Web’s Web Ontology Language (OWL) is more expressive than BusinessObjects, but points to a lack of basic understanding for business rules in OWL that would make it far too easy for users to ask — and answer — questions that made little sense. The richness and power of the Semantic Web stack is, in many ways, the greatest impediment to easy adoption by mainstream business users.

Despite these concerns, Cras acknowledges that the company continues to closely monitor developments within the Semantic Web community and he speaks enthusiastically of the opportunity for further experiment under the aegis of a new BusinessObjects-funded Chair in Business Intelligence at the École Centrale in Paris.

Oracle’s Jeff Pollock also raised the notion of Business Intelligence and the Excel-user, identifying a class of ‘lightweight analytics’ quite separate to the richer capabilities of BusinessObjects or Siebel. Boston-based Cambridge Semantics is one company to recognize this particular opportunity with their Anzo for Excel solution, which strengthens the link between the comfortable and convenient analytical capabilities of Microsoft Excel and the core data repositories elsewhere in the enterprise.

Ensuring a logical Segregation of Duties

Moving beyond Business Intelligence, Oracle’s Pollock points to LogicalApps, acquired by the company in late 2007, and notes that it “uses [Semantic Web specification] RDF far under the hood, with little fanfare.”

LogicalApps delivers compliance solutions to monitor and enforce enterprise financial controls and the ‘segregation of duties’ so necessary to avoiding abuses of position. Wikipedia quotes R.A Botha and J.H.P Eloff in stating that segregation of duty “as a security principle, has as its primary objective the prevention of fraud and errors. This objective is achieved by disseminating the tasks and associated privileges for a specific business process among multiple users. This principle is demonstrated in the traditional example of separation of duty found in the requirement of two signatures on a cheque.”

Use of semantic technologies does not enable LogicalApps to solve the previously insoluble, and nor does it create a new business category. Rather, Pollock argues, constructing a semantic model of the interrelationships between the plethora of overlapping and potentially conflicting systems permeating most enterprise decision making processes results in a more flexible data model than would be feasible with traditional approaches. This enables LogicalApps’ customers to adapt more rapidly and effectively to organizational and regulatory change, saving time, money and potentially disruptive interference from the financial authorities.

Oracle’s core database product, too, embraces the Semantic Web and although Pollock is quick to say “few would ever think of Oracle as supplying ‘semantic web’ solutions,” he does note that “RDF support is already a major feature of the Enterprise Edition database and will be coursing through the veins of many other Oracle products in the future.”

Semantic Web Inside

Whilst there will be cases in which a ‘semantic’ solution disrupts an existing market, or creates a new one, it seems far more likely that semantic technologies and techniques will be of greatest utility adding to the functionality of existing products. Oracle see value in bringing RDF to their flagship database, but it still does everything it did before and the value proposition expressed in corporate marketing and sales collateral remains unlikely to give more prominence to the product’s RDF capabilities than to SQL support, scalability, performance, or any of the other criteria upon which an enterprise might select Oracle in preference to its competitors. It is interesting, however, that Microsoft [MSFT] does give some prominence to semantic technology in collateral for their Interactive Media Manager;

“IMM introduces a powerful, XML-based, Semantic metadata model that uses the Resource Description Framework (RDF) and Web Ontology Language (OWL) specifications from the World Wide Web Consortium. Whereas traditional metadata merely lists characteristics, this RDF model makes descriptive statements about resources. This allows companies to add nuance and intelligence to media management beyond what is possible with traditional metadata.

With the IMM Semantic Metadata Store, computers can automatically understand complex relationships between media assets and categories based on those assets’ metadata properties. Potential benefits of this approach include: improved search relevance, enhanced workflow tracking, and automatic transfer of metadata properties to new assets during transcoding. Most importantly, the IMM RDF model overcomes traditional barriers to metadata sharing between external systems.”

Semantic technologies are far more prevalent within the enterprise than many assume, but it’s a quiet revolution and one that is typically more likely to be carried on the back of the product update cycle of incumbent suppliers than by the entry of a high profile semantically powered challenger. Does this mean that the road to success for today’s semantic technology companies is one that necessarily involves acquisition by SAP, Oracle and their peers? Perhaps only time will tell.


Paul Miller

Paul Miller is the founder of The Cloud of Data.  Paul offers analysis and consultancy at the interface between the enterprise and the web, specifically focusing on Semantic Technologies and Cloud Computing. In addition to bespoke consultancy services, Paul writes on the Semantic Web for ZDNet and on Cloud Computing for CloudAve. He regularly conducts podcast interviews with Executives from companies leading in the exploitation of the opportunities offered by these new technology trends, and chairs the monthly Semantic Web Gang podcast.