Loading...
You are here:  Home  >  Data Blogs | Information From Enterprise Leaders  >  Current Article

Opinion: Nova Spivack on a New Era in Semantic Web History

By   /  December 23, 2014  /  1 Comment

[Editor’s Note: Thanks to Nova Spivack for this guest post. Nova is a frequent speaker and blogger as well as CEO of Bottlenose, which uses big data mining to discover emerging trends for large brands and enterprises. More about Nova can be found on his website.]
Photo of Nova Spivack2014 was the end of an era in Semantic Web history, and the beginning of a new one. The era that ended was the first wave of the Semantic Web. The era that began was the Era of Cognitive Computing.

As for the end of the first wave of the Semantic Web, Google announced it would end the Freebase project, contributing the data to Wikidata. Freebase, while not based on W3C Semantic Web standards, was certainly one of the most significant semantic open data initiatives and helped to lead to Google’s knowledge graph. It’s interesting that Google has decided to end the project. But even more interesting is why. Google has evolved their knowledge graph beyond the need for Freebase, on two fronts.

Firstly as Google Fellow, Ramanthan Guha, explained this year at the first annual Cognitive Computing Forum, Google’s Schema.org Web metadata framework has become a major data source for Google’s Knowledge Graph. Schema.org is now used across over 5 million Internet domains, and continues to see growth. Guha also stated that a growing majority of the top sites today are providing Schema.org metadata. The broad level of adoption of Schema.org makes it arguably the most successful semantic data project ever. What is notable about Schema.org is that it is a decentralized metadata scheme — very much in line with some of the early predictions in Paul Ford’s epic 2002 article about how Google could beat Amazon and eBay to the Semantic Web.

Secondly, Google’s ongoing research into the Google Knowledge Vault, provides a more automated method of feeding the Google Knowledge Graph, based on artificial intelligence. This initiative applies cognitive computing, as well as data mining techniques, to extract knowledge from raw content on the Web, beyond what is explicitly contributed as metadata around that content.

These two initiatives combined, will ultimately reduce the need for a manually curated knowledge base like Freebase, and enable for a more rapidly updated database. Whereas in Freebase, the data has to be manually contributed and manually curated, in the new approach, these processes can be done both manually and automatically. Increasingly I expect the automated approach to outrun the manual approach, as cognitive computing gets more accurate. The challenge today is that machine understanding of language, while good, is not 100% accurate, and therefore all automatically extracted knowledge will contain somewhere between 10% and 60% error. These error rates can be reduced to a large extent by using competing approaches in tandem to mine for knowledge, and using statistical methods to detect and eliminate graph data set errors, however improvements in natural language understanding will also help.

Is the Semantic Web over? No, it lives on both in many Linked Open Data projects, but more extensively in the Schema.org community. In addition a galaxy of commercial ventures apply semantic, but not necessarily Semantic Web, principles to build knowledge graphs in their products — the core ideas live on. However adoption of RDF, OWL and SPARQL — the original W3C standards of the Semantic Web — have stalled. The Semantic Web is happening, but it probably won’t be as open as many had hoped.

Now for the birth of the new era: Cognitive Computing. 2014 was the year that Cognitive Computing became a buzzword, but in actuality the field, formerly known as artificial intelligence, has existed for many decades. The writing is on the wall that Cognitive Computing is the new frontier after the Semantic Web. The Semantic Web was a shift in focus to an approach we could call “Thin-AI” — the idea that much of the knowledge and intelligence needed by applications could exist outside them, in machine understandable form. Cognitive Computing represents a swing of the pendulum to the opposite extreme, “Thick AI” — where at least much of the intelligence is hard-coded, or trained, into the applications, regardless of where the knowledge exists.

Google is making progress on both fronts. Schema.org is successfully generating lots of external knowledge that can be extracted — this is Thin AI. But Google has also been hard at work on Thick AI — particularly Deep Learning — in competition with similar projects at all the major search incumbents.

Deep Learning, while impressive, is just more accurate automated machine classification. It is not knowledge modeling per se, but rather it is an approach to training systems to recognize and classify patterns. At some point in the future, when Deep Learning not only matures but the cost of computing is far cheaper than it is today, it might make sense to apply Deep Learning to build classifiers that recognize all of the core concepts that make up human consensus reality. But discovering and classifying how these concepts relate will still be difficult, unless systems that can learn about relationships with the subtly of humans become possible.

Is it possible to apply Deep Learning to relationship detection and classification? Probably yes, but this will likely be a second phase after Deep Learning is first broadly applied to entity classification. But ultimately I don’t see any technical reason why a combination of the Knowledge Graph, Knowledge Vault, and new Deep Learning capabilities, couldn’t be applied to automatically generating and curating the world’s knowledge graph to a level of richness that will resemble the original vision of the Semantic Web. But this will probably take two or three decades.

Yet regardless of what approach is used, when understanding the relationships between entities is important, there is still no substitute for hand-curated knowledge modeling today. When relationships are explicitly mentioned and described somewhere they are easy to extract. But most relationships are implicit — they are not explicitly stated but are rather inferred or logically induced. Automatically extracting and inferring the complex formal logical relationships among entities from unstructured data sources such as Web content or news articles, or medical research literature, is difficult today. In other words, for complex domains, or sophisticated knowledge modeling, there is still no substitute for hand-made ontologies; automated machine approaches cannot yet compete with human knowledge modelers on this task.

Google is moving away from hand-made ontologies — they were never a fan of them. From the early days, Google’s philosophy has been biased towards big data over manually constructed knowledge. The end of Freebase, and the rise of Knowledge Vault, are just examples of this bias. However, Schema.org‘s impressive growth and adoption can’t be ignored either, and the jury is still out as to whether decentralized ecosystems can ultimately out-scale more centralized data-mining approaches like Knowledge Vault to reach Semantic Web dominance. Although Freebase is being handed off, it is not necessarily over — it is going into the Wikidata project — which could be an increasingly important repository of open knowledge in the future. The war for the Semantic Web is not over.

  • Tony Sarris

    Thanks, Nova, for an insightful guest post about where
    things are headed with the semantic web and its relationship to cognitive
    computing. I agree that statistical or inductive approaches to semantics based
    on analyzing big data seem to have in many cases superseded the constructive
    approaches of the semantic web (including both crowd-sourced Linked Open Data
    models and the use of standardized metadata tagging by projects such as
    schema.org). But I think the best approaches are often hybridized or blended
    approaches involving both manual construction and inductive methods such as
    machine learning. I think that will be even more the case as attempts are made
    to capture and encode the logic of common everyday processes as part of
    automating, or at least augmenting, them through cognitive computing.

    I like your characterization of thin versus thick AI as well. I think another
    similar axis is fuzzy versus precise. Many of the approaches, whether
    constructive or inductive, seem to be after the Holy Grail of precision. But in
    my opinion most everyday cognitive computing functions simply don’t demand that
    level of precision to be useful. The goal needn’t be to fully automate tasks,
    but rather to augment humans performing a task. That’s really the notion of
    intelligent virtual assistants.

    Whether cognitive computing-based intelligent assistants work behind the scenes
    to proactively augment a human-based process or iteratively interact with a
    human to perform a process, there is a much higher tolerance threshold when a
    human is in the loop. And I think humans very much still want to be in the loop
    — aided by automation, not replaced or displaced by it.

    For that reason, I’m a big believer in generative approaches that can
    dynamically and inexpensively support a large set of use cases, including
    long-tail cases, by focusing primarily on generating candidate sets and
    iterating through learning loops, rather than trying to get everything right
    the first time. At the first DATAVERSITY Cognitive Computing Forum, I recall many
    comments that 80% is considered a high accuracy rate and 80% or even less is
    good enough for many types of tasks. Self-driving cars and automated guided brain
    surgery being examples of the sorts of things where higher thresholds are
    necessary, but the point is there are many more cases where high levels of
    precision just aren’t needed. For a discussion of the constructive, inductive
    and generative approaches, please see my blog:
    http://tonysarris.wordpress.com/2013/07/03/how-to-get-semantified/

    Tony

You might also like...

The Leader’s Data Manifesto Debuts: Making Data as an Asset Everyone’s Business

Read More →