In 2005, I started learning about the so-called Semantic Web. It wasn’t till 2008, the same year I started my PhD, that I finally understood what the Semantic Web was really about. At the time, I made a $1000 bet with 3 college buddies that the Semantic Web would be mainstream by the time I finished my PhD. I know I’m going to win! In this post, I will argue why.
The Web vs The Semantic Web
When somebody asks me “what is the Semantic Web?”, I immediately ask them: “What is the Web?”. Stop for a minute and think about it. You use the Web every single day of your life, but are you able to explain what it actually is? And by the way, the Web is not the Internet! It is a layer on top of the Internet.
The Web can be seen as an information space made up of two things: webpages and links between the webpages. That’s it! Most probably, the content of a webpage comes from an underlying database that stores data in a structured way. However, the webpages are unstructured. The search engines come along and crawl the webpages and have to figure out what they are all about. However, when you use the search engine, your input is a keyword and the result is a list of links to webpages. Imagine if you want to find all soccer players, who played as goalkeeper for a club that has a stadium with more than 40.000 seats and who are born in a country with more than 10 million inhabitants. Good luck finding an answer for that on any search engine.
In summary, information was originally stored as structured data; it was published in an unstructured format (HTML webpages); search engines crawl the unstructured webpages and figure out the structure and then let users make unstructured queries. Don’t you think that something is wrong here? If the information was originally stored as structured data, then it should be published as structured data on the web and search engines should enable users to make structured queries. This is what the Semantic Web is all about: structured data on the Web. Several standards have been created to accomplish this objective, such as RDF, RDFa, SPARQL, etc. The term Linked Data is a guide of how to publish data on the Web.
My position is that the Semantic Web has gone mainstream. Before I make my argument, let me first define what I mean by mainstream. The Merriam-Webster dictionary defines mainstream as a prevailing current or direction of activity or influence. For example, the Web became mainstream in the early 90s when the prevailing current was to publish your information as a webpage. Therefore, for the Semantic Web to have gone mainstream, the prevailing current is that of publishing structured data on the Web. The following are five areas that support my claim.
HTML5 is a major update on the HTML language with a host of new features. HTML5 is a significant step ahead in the Semantic Web in that it reaffirms the separation of text and multimedia content, data, and presentation. It introduces a number of new tags such as <article>, <section>, <header>, <footer> and <video> that help separate the text and multimedia content of the page from presentation template. This makes it easier to apply information extraction techniques to webpages and thus to extract data from webpages that are otherwise not prepared for the Semantic Web. In addition, HTML5 introduces a new simplified metadata language, microdata, to annotate data inside webpages with the meaning of the data items. Although not directly compatible with the Semantic Web stack, the W3C has published guidelines for converting microdata into RDF data. A less technical document intended for web publishers provides a detailed comparison of the choice between microdata, RDFa, and microformats, and how they can meaningfully be combined in a single webpage if needed.
Search engines are the largest consumers of Web content and are therefore naturally interested in Semantic Web technologies that can help them to make sense of information on the Web in order to offer users a better search experience. Google, Bing and Yahoo have teamed up to create a schema that describes what webpages are about: Schema.org. Similar to the sitemap protocol which allows a search engine to understand the structure of your website, schema.org allows you to express the meaning of your content. Is your website about music events? How about markup for your bar’s website? Schema.org offers a large list of things that you can describe. The examples in schema.org are given in HTML5’s microdata, but RDFa is also accepted by the major search engines. The use of this metadata is currently limited to displaying enhanced results in web search, which has been pioneered by Yahoo! as part of their SearchMonkey program, and later adopted by Google under the name Rich Snippets and by Bing under Bing Tiles. Google also uses data extracted from the Web to power some of their vertical search products, including Recipe Search and Video Search.
3) Open Graph Protocol
Understanding web content is not only important for improving the search experience, but it also allows companies to gain insight about the interests of users based on the kind of content they interact with. In 2010, Facebook launched the Open Graph Protocol, another set of schemas intended for webmasters to describe the semantics of their content. In the case of Facebook, the metadata is added only to the header of webpages, accepting RDFa only. Although the information that is captured is more limited than in the case of schema.org, even these small bits of information can allow Facebook to learn valuable insights about their users’ behavior outside of their own pages. In particular, whenever a user Likes the page, Facebook extracts the information to learn what the page is about. The structured data helps Facebook to render the information and enrich the users profile with his interests. If you have ever Liked something, or added a Like button, you are engaging in the Semantic Web! The Open Graph Protocol provides its own schema, different and competitive to Schema.org.
4) Big Data vs Queryable Data
Big Data is a hot topic, however, Dave McClure stated it nicely: “kinda getting tired of Big Data meme. sure, data is SHINY…but doing something useful with it is less obvious & requires thought.”. One thing is to have a marketplace of raw data (csv, xml dumps, etc). Another thing is to actually query the data and do something useful with it. APIs on data sets are a pre-canned set of queries, which limits what you really want to ask. Open Data, specially Government Data, has taken the lead on offering query interfaces directly on their data. In other words, you can write a query and get the data back. The query is written in a standardized language called SPARQL. The US’s data.gov and UK’s data.gov.uk both offer SPARQL services. For example: http://data.gov.uk/sparql or http://health.data.gov/sparql . Startups like Kasabi are leading the effort of creating a queryable data marketplace. But that’s not it. For the past several years, their has been a community effort to “structurize” Wikipedia, called DBpedia, in order to allow users to write SPARQL queries directly on Wikipedia data. Recall my initial query: find all soccer players, who played as goalkeeper for a club that has a stadium with more than 40.000 seats and who are born in a country with more than 10 million inhabitants? Problem solved! DBpedia has been a community effort but the Wikimedia foundation has decided to get on board with this effort. This year, they announced the Wikidata project, which “aims to create a free knowledge base about the world that can be read and edited by humans and machines alike”.
5) Semantic Enterprise
Another measure of mainstream is the influence and adoption of Semantic Web technology inside corporations and the government. Although most enterprise applications using Semantic Web technologies are hidden from plain view, we can glean an insight from projects that are public and the increasing number of tool and service providers that are supplying corporations and the government. As previously mentioned, Yahoo! pioneered adding semantic markup to webpages. Google and Bing followed. Now they teamed up to create Schema.org. Believe it or not, Apple’s Siri is an outcome of government-sponsored Semantic Web research and IBM’s Watson is also powered by Semantic Web technology. Large media players are using Semantic Web technologies. The NY Times for the past 3 years has been publishing their data as Linked Data and successfully pushed the creation of a standard for semantic markup for news data: rNews. The Associated Press is offering a standardized AP News Taxonomy, all based on Semantic Web technologies. The Tribune Company also has a semantic program. Thomson Reuters has a service, Open Calais, that takes text and outputs Semantic Web data. BBC has also been publishing their data as Linked Data for several years and their 2010 World Cup website was powered exclusively by Semantic Web technologies. Yahoo! is building a Web of Objects that comprises of information regarding all entities known to Yahoo! in a variety of domains (movies, music, sports etc.). The schema for this knowledge base is an OWL ontology. BestBuy can be considered the first player to demonstrate the SEO benefit of adding semantic markup. Overstock came afterwards. Volkswagen has created a contextual search engine based on Semantic Web technology. Amdocs is using semantic databases to harness intelligence for customer service applications. Experian recently bought Garlik, a company that used Semantic Web technologies to help consumers protect themselves from identity theft and financial fraud. Cray recently released uRiKa, a super computer specialized to do graph analysis on big data. It comes with Semantic Web software out-of-the-box. National Libraries, including the Library of Congress, Spanish and German National Libraries, are moving from the 40-year old MARC data format to Semantic Web standards. There are Semantic Web technologies involved in NASA, and in several industries such as financial, medical, pharma, publishing… even firefighters in Amsterdam use Semantic Web technologies to save lives. And I haven’t even started talking about tools. Semantic Web technologies are supported in Drupal 7 core. Several e-commerce CMS support Semantic Web technologies for SEO. Oracle has a semantic database and has a large set of competitors: 4Store, AllegroGraph, BigData, Dydra, Meronymy, OWLIM, Stardog, Virtuoso. What about startups? Check out just a few: Attune, Capsenta, GetGlue, Hakia, Kasabi, Seevl, SocialWire, Tumbup, TrueKnowledge, Zemanta,
As the Semantic Web becomes ubiquitous, new opportunities will surface for disrupting existing marketplaces using open technology and increasing the level of intelligence in user facing products.
For example, both Apple’s Siri and TrueKnowledge’s Evi do a reasonable job at question answering, they are essentially closed systems where the providers get to decide which data sources and services are included in the responses for questions. One could easily imagine a version of the digital assistant that is built on open data and services and responses come from the sources that are expected to provide the most relevant and accurate answers.
We also foresee major advances in personalization. At the current state-of-the-art, Yahoo!’s CORE algorithm based on machine learning cycles through 45,000 variations of Yahoo!’s homepage every five minutes to optimize the overall click-through rate by considering user demographics and broad topical interests. More sophisticated algorithms will be based on detailed user models that capture the users current and past relationships to each and every other entity in the world. From what is publicly known, such entity graphs are currently being built at Yahoo!, as well as Google. The graph is of course also the natural data model for Facebook. At Facebook, the power of the representation model has been significantly increased recently by adding ‘verbs’ that describe the relationships between objects (e.g. “John ate Pizza”) from simple likes (“John likes Pizza”).
In conclusion, the Semantic Web has gone mainstream and is impacting web developers, publishers, consumers and enterprise. I should be (hopefully) finishing my PhD next year and I think I’m going to win my bet. What do you think?
If you’d like to learn more about all of this, I hope you’ll join me at the SemTechBiz Conference in San Francisco this June. (Early Registration discount pricing ends Thursday, March 15!)
NOTE: This post was co-authored with Peter Mika (Yahoo!) for a presentation at the 2012 SXSW Conference.