[Editor’s Note: At the Semantic Web Summit conference in Boston in November, a discussion arose around Federated Data vs. Data Warehousing. Rob Gonzalez of Cambridge Semantics raised some very interesting points that I asked him to expand on in the post below. And whether you agree or disagree, we want to hear from you.
The Semantic Web dream of data federation is awesome. You type in a query, and magical, intelligent agents scurry all over the datasphere, bringing back information to give you a complete, up-to-date, correct answer to your question. No need for a messy, time-consuming datamart project! What’s not to love, right?
Eric asked me to write this piece, and so I find myself in the unenviable position of having to tell you, dear reader and Semantic Web fan, that there is no Santa Claus (of data federation). I’d like to make a case for the continued need for data consolidation in datamarts—yes, even in the Semantic Web-world—to gain real value from your enterprise data.
While there are many specific technical things I could address, I’ll stick to the two that are at the same time both simple and damning: Performance and Query Functionality. I’ll also address the supposed “data freshness advantage” that federation proponents preach.
In the Semantic Web world, it’s very easy to express intelligent questions. At Cambridge Semantics we built a demo last year that lets you ask, “did the seniority of your senators effect how much Obama money your state received?” and basically any other question you might want to ask about the American Recovery and Reinvestment Act. This kind of application requires a database that is both more flexible than traditional databases and capable of dealing with semi-structured, jagged data, which is, of course, where the Semantic Web’s flexible data model, RDF, steps in.
In a federated world you’re trying to answer these kinds of questions by querying traditional databases during query processing. In effect you’re asking completely unanticipated questions of traditional databases, which were not designed to handle unanticipated questions with any sort of performance guarantee.
In 2010, Oracle DBAs make $100,000+. They don’t make this money because Oracle is a particularly hard RDBMS to manage. They can demand it, at least in part, because performance tuning is hard. When a system faces increasing user loads on growing datasets over time, and is kept on the same old hardware, much must be done to keep its performance acceptable.
If it’s hard enough to get a high powered database to be fast at solving the problems it was designed to solve, how do you expect it to solve the increasingly sophisticated queries made possible in semantic web interfaces (especially if you consider how easy it is to express complex JOIN logic with SPARQL)? What you’ll end up with is stressed out DBAs tuning for long running queries for which their systems were not designed, and which many never be run again!
This also makes the bold assumption that you’re lucky enough to have a set of throw-caution-to-the-wind DBAs manning the source systems who will let you run any old random query against their production databases during peak hours. Most DBAs are happy to provide you with data dumps on a schedule, and maybe with a way to query for updates more regularly, but will absolutely not let you put their already strained transactional systems under additional load from your ad hoc queries.
Data federation solutions also have the risk that any single system under federation could go down, leaving the federated query to either ignore the fallen system—which means that a federated query will return variable results depending on which source systems are online—wait for the fallen system to come back online, or fail.
What’s worse, a source database for your federated solution doesn’t even have to be down to hurt the performance of the federated queries. There are periodic times during the course of a day when transactional systems are under heavy load, are being backed up, suffer temporary network slowdown (if your data exists in several databases over a Wide Area Network), or any number of such common realities. If your federated query is sent over a reasonably large number of systems, the chance that any individual system either is slow to respond or doesn’t respond at all to a particular query can be unreasonably high.
In contrast, with a datamart you pull data dumps at scheduled times into a centralized location that can be tuned for queries. Said another way, instead of doing any and all data integration and transformation on the fly, you’re doing it all up front, and then mostly performing calculations, and not integrations and transformations, when queries comes in. I just don’t see how the two solutions can really compare when performance is needed, which is in every single interactive application that you expect to build.
Query Functionality Gaps: Separate is Unequal
There are fundamental query operations that a federated query cannot do, but that can be accomplished with a datamart. Rather than get into a fancy quantitative analytic example I’ll instead talk about the boring old Computer-Science-101 problem of sorting.
First, I’d like you to go to Amazon.com and search for “data federation.” It’s cool; I’ll be here when you get back. I got 239 results when I did it, which is too many to flip through without sorting. If you do try to sort the results you’ll notice this problem:
Why, Amazon? Can’t you just show me the cheapest of 239 things? Surely that’s not too many to sort using the famously massive computing resources you can bring to bear!
For individual sort keys that are universal, you can simply have each federated source return a full page of results for the first page. But for compound sort keys that require data from multiple data sources—for example, if you want to sort partly by a product’s price, which would be in one system, and partly by its sales volume, which might be in another system, and you combine the two numbers multiplying or some other machination—you have no option other than to bring all the sorting data from all possible objects that match the query into one place, calculate the compound sort key, sort, and then re-query the source systems!
So one reason that Amazon might be forcing you into a specific department is that their proprietary relevance sort might rely on such a key, but that’s just speculation. What is undeniable, however, is that in corporate situations compound or calculated sort keys occur all the time.
If you have a lot of data, it is not practical to compute compound sort keys for every query. You bog down your network, your source data systems, and whatever poor computer is trying to pull everything together on the fly.
So data federation doesn’t even let you sort things effectively, and that’s just the tip of the iceberg. The reality is that there are many examples of query functionality that require re-querying, bringing massive subsets of data from source systems into one place, or other machinations that are completely impractical for data federation. You can’t really sort, and you can’t do lots of other things you want to do either.
(To those who say, “But we’ll just allow simple sorting!” I challenge you to keep the scope of any application you build limited in that way for its lifetime…and if you do think you can plan out your project so concretely, why bother with Semantics anyway, except to maybe publish the data in the system?)
“So what?” the data federation enthusiast might say. “For the queries I can answer, I’m going to give a more up-to-date answer than your fancy warehouse, which has to wait for its periodic updates! And in this Twitter-crazed, 24 hour news network, text happy world we’re in, Fresh is King!”
I understand this argument, but I don’t understand why federation advocates consider federation the best way to provide it. People can and do build ETL pipelines all the time that update downstream systems when a source system’s data changes. High end products like Informatica have been doing this for well over a decade, and today can ensure real-time data freshness. If fresh is really what you care about, then you can absolutely get it with a datamart.
So then the last thing federation advocates would push is speed to solution. As a reader of this site you know that semantics enables rapid data integration much more readily than traditional technology. For that reason I believe that semantic datamarts don’t have anything like the cost associated with traditional datamarts, because RDF enables new data to be integrated without having to modify the fundamental data model and system design, enabling incremental development of a datamart. This seriously mitigates the time-to-market advantage that a non-semantic data federation solution might be able to claim over a non-semantic datamart.
An Exception: Network Security in Intelligence
There is one specific example in which building a datamart is impossible that I’d like to call out.
In certain security contexts, information is queryable but not storable. Clearly building a datamart under such conditions is impossible, so companies turn to federation as an alternative. I’ve seen this both in the federal government as well as in the private sector with defense contractors.
This is not at all to say that semantically linking datasets isn’t valuable. On the contrary! I believe that coating old, weather-beaten databases with a coat of semantic paint is awesomely valuable. It makes creating ETL pipelines that bring together data from all kinds of locations a breeze as compared to traditional, relationally-oriented ETL pipelines. It’s hardly even fair to compare the two approaches, except insofar as the maturity of the traditional technologies is concerned, and I’ll try to pick up on specific reasons for this belief in future posts. In fact, I see semantics as enabling on-demand datamarts in ways that traditional data integration technologies simply have failed to do.
For now, however, I believe most projects will better serve their customers by going through the exercise to create a semantic datamart or semantic data warehouse than by following the fool’s gold promised by enterprise data federation solutions.
For another perspective, see The Federated Enterprise (Using Semantic Technology Standards to Federate Information and to Enable Emergent Analytics) by Michael Lang, Revelytix.]
Photo Courtesy: Flickr/Ken Lund