Ask a group of Semantic Web professionals where the data should live when you’re doing data integration projects – which is just what Cambridge Semantics VP Lee Feigenbaum, acting in his capacity as co-chair of the W3C's SPARQL Working Group, did at a panel at last week’s SemTech – and don’t expect to get a single, agreed-upon answer.
Among the choices:
“Federation will crush warehousing,” Eric Prud’hommeaux of the W3C and its Semantic Web Health Care and Life Sciences Interest Group said with an eye to provocation. “Leave data where the authorities have it and take advantage of individual domain contributions.” The basic idea of federation is that data stays in its source systems and you do integration dynamically, querying source systems on the fly.
“The data warehousing approach leads to a better user exp or application for people” in most instances, is the conclusion of Cambridge Semantics senior product manager Rob Gonzalez. With data warehousing, an organization takes all the data it want to integrate, brings it into a single database like an RDF store, and then runs apps against this warehouse.
For Michael Lang, CEO, chairman and founder of revelytix, which publishes tools mostly around the federation problem, split the difference, a bit. “It always will be the case that information that business users want is not going to be in a single place. It will be distributed in the enterprise or among partners,” he said, with the caveat that “for certain kinds of analytics queries it is also true that they will only run right if all the data is in one place, at least for now.”
And Kendall Clark, co-founder and managing principal of Clark and Parsia, would like to know what all the fuss is about – at least in the interest of generating the kind of controversial statement every good panel discussion needs. “The primary question we are here to consider doesn’t matter,” he said. ”The Semantic Web, RDF, SPARQL are all about not caring about irrelevant low-level details and not tying things to these irrelevant details, like where is the data. That doesn’t matter except for lifecycle analysis or building applications, but from the point of view of those consuming the data these low- level ‘where are the data’ details don’t matter.”
Obviously, however, they did matter enough to take up the allotted time of the panel and post-panel discussions. Gonzalez, for instance, raised his objection to the idea that data locale doesn’t matter, including to end users. “This is not just about a fantasy world where you pull together data whenever, at any time, any speed, and any scale and get the experience end users need,” he said. “End users need to interactively work with data and if you don’t pay attention to where the data resides you leave performance and accountability and compliance and usage issues on the table or in an inconsistent state.” In Gonzalez’ view, the federation approach introduces problems on a number of fronts: performance issues when ad hoc queries go against random databases run by administrators who have not indexed them for those queries; less predictability of downtime of the various systems in the federation; and possibly inaccurate results when big systems haven’t been uniformly updated. “It’s harder to make federation really work than warehousing,” he said.
When it comes to federation, there are advantages the warehouse approach doesn’t have, its proponents say – for instance, Prud’hommeaux pointed out that you don’t have to take data like medical information, handcuff it to your wrist, and agree to being responsible for it while you have it and to throw it away when you’re done, “which is hard when you’re doing research and you want people to see the results.” There are issues to solve but they are solvable, they said. Governance of data and information in organizations where each business unit owns and controls its own data and the policies regarding it (including security) as well as its business uses must radically transform. “The change is simple. Your compensation as a manager will change unless you work out governance policies so your data is shared well for the benefit of shareholders,” Lang said. “Everything happens in compensation.”
To clarify his point, Clark noted later that the reason he made the statement is that the “mojo of the Semantic Web is to rarely put customers in a bind where we say that you have to do it this way.
So we should not distort, at least at the standards level, forcing this conversation to be resolved one way or another.”