How will webpage data be interpreted in the next few years? The Semantic Web community has high hopes for ever evolving semantic standards to help systems identify and extract rich data found on the web, ultimately making it more useful. With the announcement of Schema.org support for GoodRelations in November, it seems clear semantic progress is now being made on the e-commerce front, and at an accelerated rate. Martin Hepp, founder of GoodRelations, estimates the rate of adoption of rich, structured e-commerce data to significantly increase this year.
However, Mike Tung, founder and CEO of a data parsing service called DiffBot, has less faith that the standards necessary for a true Semantic Web will ever be completely and effectively implemented. In an interview on Xconomy he states that for semantic standards to work correctly content owners must markup the content once for the web and a second time for the semantic standards. This requires extra work, and affords them the opportunity to perform content stuffing (SEO spam).
Since schema.org launched, the search engines involved (Google, Yahoo!, Bing, and Yandex) have been very successful getting website owners to add structured data, and ultimately displaying these data in rich search engine results, like Google’s Rich Snippets. But what Tung says has held true, at least to some degree. SEO specialists see that content rich snippets help improve click-through ratios. For example, a review or article may showcase a picture of the author and perhaps some rating stars. This may catch the searcher’s attention first, and thus improve click-through and even conversion rates. Unfortunately, pages are sometimes filled with falsified or exaggerated structured data, simply to stand out in search results.
If an AI solution, like DiffBot, was always 100% accurate, there may not be tremendous need for semantic markup. But that isn’t the case, at least not yet, and may take quite some time to get there… and some will argue that Visual AI will never be fully successful either. As technology and creativity on the web is ever evolving, so is the way we see it. The fashion vertical of e-commerce is a prime example, as quite a bit of aesthetic variety occurs. This aesthetic disparity may be difficult for AI to accurately detect critical information points. For example, I’d say at least 90% of the time the product title appears as the largest content heading on an e-commerce page. But on some high-end fashion sites, like Net-a-Porter, the brand or designer is the largest font, and the product title may be much smaller.
Little differences like this can throw off even a human, so how will AI perform? It’s likely to be a bit problematic. In the world of data interpretation and extraction, getting precise information across a very diversely structured web is often difficult. The reality is that no system is perfect yet, and both semantic markup and computer vision have obstacles ahead. While the approaches of Hepp and Tung may differ, at the end of the day the goal is the same: make web data more useful and accessible. With progress coming on both ends, we are moving in the right direction.
About the Author
This guest post comes from e-commerce entrepreneur Marc Mezzacca. Marc runs a social-media based coupon code website called CouponFollow, and recently launched an automated coupon notification app, Coupons at Checkout. You can follow him on Twitter.