Sindice Ltd. launched as a startup company this week, complete with a publicly available beta SPARQL endpoint to its indexed and live-updated dataset of some 12 billion triples. Next week will see Sindice –which began as a joint academic research project among DERI, the Fondazione Bruno Kessler and OpenLink Software to collect, search, query and build applications on top of semantically marked up Web data — deliver formal support for Schema.org.
Sindice, of course, is agnostic when it comes to ingesting semantic markup formats. Supporting new formats is just a matter of syntax adaptation for the service. Whatever format a web site decides to employ — from RDF to RDFa to microformats to microdata — Sindice has coverage of the structured web data and keeps it fresh.
The service opens up vast possibilities for business: As long as a web site structures data in one of these formats, and uses standards like Sitemaps for publishing semantic content, it can become a part of Sindice’s continuously updated repository. And thus it become a datasource for business use, one that also can join with other datasets.
Interview with Sindice’s Giovanni Tummerello:
“What Sindice allows is syndication for bits of data” from across multiple web sites, says Dr. Giovanni Tummarello, CEO of Sindice LTD. who has been R&D lead on the project. “There is a new market where bits of data syndicated across the web will provide value to users but also to whoever provides the markup.”
Sindice in Practice
On the consumption end, think of Sindice as making the “web of data your playground,” as Tummarello puts it. When semantically marked-up data is immediately interconnected, “it is just a matter of asking [of it] the right question. Sindice affords both those things.”
As an example, Sindice on its blog shows how with one query via its SPARQL endpoint a user could discover from information spread across multiple marked-up web sources those individuals from the current DERI team page who presented a paper at a recent Linked Data workshop. And furthermore, find in those cases the title of the paper and its writer’s current DERI profile picture. “SPARQL allows in a single query to join different aspects of data coming from different websites,” Tummarello says. “And since Sindice has a live SPARQL endpoint, it’s not just [working with] a static piece of knowledge that’s built and loaded once.”
Or businesses could create vertical search engines for e-commerce buying or selling services, to pull product data for a category from around hundreds of websites that use semantic markup, defining that a product listed on one site under one name is actually the same as one listed elsewhere under a different moniker. They then can classify and then compare the same product offerings for best price or other capabilities. Tummarello says that with Sindice it is possible to build such a solution in comparably less time than it would otherwise take.
When it comes to e-commerce, the potential of semantic markup for that application “is now firmly recognized by all three major search companies on the planet – after Yahoo (2008) and Google (2010),” says GoodRelations vocabulary inventor and lead developer Prof. Martin Hepp in a press release announcing the launch of Sindice as a startup. “Bing has recently also announced plans to support the GoodRelations standard. Today’s search engines, however, harvest only the tip of the iceberg of this data solely for a better rendering of the search results. Sindice’s technology allows much more sophisticated novel commerce applications.” Hepp is backing Sindice via his Hepp Research semantic data consulting firm.
Speaking of vertical search engines, on the way from Sindice is one for specific domains supported by the Schema.org microformat. “But the cool thing about the technology,” Tummarello emphasizes, “is that it is entirely independent. In biotech, for instance, you could build a specific search engine for proteins across thousands of databases that use Linked Data RDF markup, which is particularly popular in the biosciences industry.”
The schema.org announcement actually was well-timed, as it turned out, for Sindice, as the service certainly is going to be able to immediately make use of the data, Tummarello says. “On the web I think everyone will do what Google and others are saying because of search engine optimization,” he notes. That said, he also is fairly confident that Schema.org will become more compatible with what already exists in the markup world, including advanced vocabularies that can support more sophisticated use cases.
Some challenges that still remain for the Web of Data at large revolve around quality of data markup, which Tummarello says experience shows still can be questionable in some instances. He expects that to change, though, especially as the economic incentives for improving quality grow.
“This is just the beginning, and as scenarios for data syndicated across the web, where bits and pieces of data are coming from different web sites, become valuable for those consuming and producing it, there will be rewards,” he says. Better quality markup by producers means the data will be easier to write queries against and to integrate by consumers to enable their own ends, which ideally should promote a virtuous cycle.
“If there is a proper set of incentives data quality will immediate improve. And things like Sindice that get us to more value in consumption and production are steps forward to having cleaner data on the web.” He also says not to worry excessively about semantic spam and the idea that people will lie when it comes to metadata. Those who do will be punished for it, much as has happened to those that have tried to game search engines to date. “Not every piece of information or markup will be considered. There will be more reputable and less reputable web sites, and those that are kicked off because of lying from their indexes,” he says. “There will be criteria and metrics to measure lying and cheating.”
Sindice is shopping around for capital to fund its startup status now. In terms of securing investor interest, it can boast, for example, that it’s already had requests from enterprises to use its platform to put their own datasets within private semantic spaces – potentially a source of revenue. And it has hopes that it can capitalize on its goal of helping people monetize their own data in other respects, as well, whether that’s via fee-based API access, advertising. or enabling a commercial marketplace for data publishers and those who want to integrate that data into their apps.
As Tummarello sees it, there’s no way around getting on board with the Web of Data. “Some in the Semantic Web research community have said that there is no alternative to the Semantic Web, and that is entirely true. There is no alternative to a web where objects are interconnected because of what they really mean,” he says. “It is so much more of an incredibly valuable web than a web of documents.”