[Editor’s Note: Thanks to Nova Spivack for this guest post. Nova is a frequent speaker and blogger as well as CEO of Bottlenose, which uses big data mining to discover emerging trends for large brands and enterprises. More about Nova can be found on his website.]
2014 was the end of an era in Semantic Web history, and the beginning of a new one. The era that ended was the first wave of the Semantic Web. The era that began was the Era of Cognitive Computing.
As for the end of the first wave of the Semantic Web, Google announced it would end the Freebase project, contributing the data to Wikidata. Freebase, while not based on W3C Semantic Web standards, was certainly one of the most significant semantic open data initiatives and helped to lead to Google’s knowledge graph. It’s interesting that Google has decided to end the project. But even more interesting is why. Google has evolved their knowledge graph beyond the need for Freebase, on two fronts.
Firstly as Google Fellow, Ramanthan Guha, explained this year at the first annual Cognitive Computing Forum, Google’s Schema.org Web metadata framework has become a major data source for Google’s Knowledge Graph. Schema.org is now used across over 5 million Internet domains, and continues to see growth. Guha also stated that a growing majority of the top sites today are providing Schema.org metadata. The broad level of adoption of Schema.org makes it arguably the most successful semantic data project ever. What is notable about Schema.org is that it is a decentralized metadata scheme — very much in line with some of the early predictions in Paul Ford’s epic 2002 article about how Google could beat Amazon and eBay to the Semantic Web.
Secondly, Google’s ongoing research into the Google Knowledge Vault, provides a more automated method of feeding the Google Knowledge Graph, based on artificial intelligence. This initiative applies cognitive computing, as well as data mining techniques, to extract knowledge from raw content on the Web, beyond what is explicitly contributed as metadata around that content.
These two initiatives combined, will ultimately reduce the need for a manually curated knowledge base like Freebase, and enable for a more rapidly updated database. Whereas in Freebase, the data has to be manually contributed and manually curated, in the new approach, these processes can be done both manually and automatically. Increasingly I expect the automated approach to outrun the manual approach, as cognitive computing gets more accurate. The challenge today is that machine understanding of language, while good, is not 100% accurate, and therefore all automatically extracted knowledge will contain somewhere between 10% and 60% error. These error rates can be reduced to a large extent by using competing approaches in tandem to mine for knowledge, and using statistical methods to detect and eliminate graph data set errors, however improvements in natural language understanding will also help.
Is the Semantic Web over? No, it lives on both in many Linked Open Data projects, but more extensively in the Schema.org community. In addition a galaxy of commercial ventures apply semantic, but not necessarily Semantic Web, principles to build knowledge graphs in their products — the core ideas live on. However adoption of RDF, OWL and SPARQL — the original W3C standards of the Semantic Web — have stalled. The Semantic Web is happening, but it probably won’t be as open as many had hoped.
Now for the birth of the new era: Cognitive Computing. 2014 was the year that Cognitive Computing became a buzzword, but in actuality the field, formerly known as artificial intelligence, has existed for many decades. The writing is on the wall that Cognitive Computing is the new frontier after the Semantic Web. The Semantic Web was a shift in focus to an approach we could call “Thin-AI” — the idea that much of the knowledge and intelligence needed by applications could exist outside them, in machine understandable form. Cognitive Computing represents a swing of the pendulum to the opposite extreme, “Thick AI” — where at least much of the intelligence is hard-coded, or trained, into the applications, regardless of where the knowledge exists.
Google is making progress on both fronts. Schema.org is successfully generating lots of external knowledge that can be extracted — this is Thin AI. But Google has also been hard at work on Thick AI — particularly Deep Learning — in competition with similar projects at all the major search incumbents.
Deep Learning, while impressive, is just more accurate automated machine classification. It is not knowledge modeling per se, but rather it is an approach to training systems to recognize and classify patterns. At some point in the future, when Deep Learning not only matures but the cost of computing is far cheaper than it is today, it might make sense to apply Deep Learning to build classifiers that recognize all of the core concepts that make up human consensus reality. But discovering and classifying how these concepts relate will still be difficult, unless systems that can learn about relationships with the subtly of humans become possible.
Is it possible to apply Deep Learning to relationship detection and classification? Probably yes, but this will likely be a second phase after Deep Learning is first broadly applied to entity classification. But ultimately I don’t see any technical reason why a combination of the Knowledge Graph, Knowledge Vault, and new Deep Learning capabilities, couldn’t be applied to automatically generating and curating the world’s knowledge graph to a level of richness that will resemble the original vision of the Semantic Web. But this will probably take two or three decades.
Yet regardless of what approach is used, when understanding the relationships between entities is important, there is still no substitute for hand-curated knowledge modeling today. When relationships are explicitly mentioned and described somewhere they are easy to extract. But most relationships are implicit — they are not explicitly stated but are rather inferred or logically induced. Automatically extracting and inferring the complex formal logical relationships among entities from unstructured data sources such as Web content or news articles, or medical research literature, is difficult today. In other words, for complex domains, or sophisticated knowledge modeling, there is still no substitute for hand-made ontologies; automated machine approaches cannot yet compete with human knowledge modelers on this task.
Google is moving away from hand-made ontologies — they were never a fan of them. From the early days, Google’s philosophy has been biased towards big data over manually constructed knowledge. The end of Freebase, and the rise of Knowledge Vault, are just examples of this bias. However, Schema.org‘s impressive growth and adoption can’t be ignored either, and the jury is still out as to whether decentralized ecosystems can ultimately out-scale more centralized data-mining approaches like Knowledge Vault to reach Semantic Web dominance. Although Freebase is being handed off, it is not necessarily over — it is going into the Wikidata project — which could be an increasingly important repository of open knowledge in the future. The war for the Semantic Web is not over.