Down with the Data Warehouse! Long Live the Semantic Data Warehouse!

By   /  September 30, 2011  /  No Comments

East wall of Courtyard brick work, construction of the McKim BuildingI had a call with a Fortune 100 IT team that is looking at using semantic technology as an alternative to the Data Warehouse.  This is my favorite kind of conversation, since I firmly believe the traditional data warehouse is dead but just doesn’t know it yet.

This is the situation the IT team explained:

We need to aggregate information and present it to the user, so we build a warehouse.  We spend all this time building and designing the warehouse, and when it’s done they need something else.  Unfortunately, it’s not so easy to modify a warehouse once it’s running, so we build another one.  And then another.  The cycle has been repeating itself for years and is not sustainable.

Philadelphia Spectrum demolition: brick by brick

The alternative to warehousing is Data Virtualization (EII, Data Federation…lots of terms for it)…or, at least that’s what they, and many others, see!  Essentially, they have been burned by years of working with an inflexible technology, so are looking to dump the approach all together.

I get this.  If a Durian is the only fruit you’ve ever smelled, you’d think all fruit were really stinky.

Frankly, what they’re looking at is a sucker’s choice: either you take the warehouse and all of its evil, or you try Federation and all of its evil.  No middle ground.

In a previous post, I’ve Got a Federated Bridge to Sell You (A Defense of the Warehouse),  I had defended the Semantic Data Warehouse.  In this post, I will continue to do so.  I had claimed the following:

This is not at all to say that semantically linking datasets isn’t valuable.  On the contrary!  I believe that coating old, weather-beaten databases with a coat of semantic paint is awesomely valuable…In fact, I see semantics as enabling on-demand datamarts in ways that traditional data integration technologies simply have failed to do.

I firmly believe that the underlying cause is the inflexible, tabular, RDBMS data model.  Whether you use a Warehouse or Data Virtualization, if you use those old techniques invented in 1970 for 1970s hardware and 1970s computer problems, you will find yourself with an inflexible system.  Period.

Much has been written on this blog and elsewhere that Semantic Web Technologies start with the Open World Assumption and with a flexible data model (see both Feigenbaum’s and Mike Bergman’s recent, wonderful posts about this).  The OWA, in laymen’s terms, is the assumption that you never have all the facts.

This combination is a fundamental departure from traditional systems, and I cannot possibly underscore that enough.

Old School, Inflexible Database Design

Old School, Inflexible Database Design

Think about it: with a relational database (including relational data warehouses, and even OLAP tools), there is a fundamental assumption demanded by the technology that you can create a schema up front.  The first step in building one of them is to gather requirements.  Then you build a conceptual model.  Then a logical model.  At some point later after the critical problem has already grown a little stale, you implement the thing.

The reason for this is that once you have implemented it, changing the model is not cheap! You have to go through the whole process again, and sometimes it’s easier to set up a new warehouse or data mart than fixing the initial one.

The biggest time waster here is the coordination involved.  Since it’s expensive to change the system, you need everyone’s buy-in.  So there are all these meetings you have, but someone’s always on vacation or had to cancel last minute for a client engagement, and so you lose yet another week in the process.  Eventually, sometimes years later, your data warehouse project goes live, when it’s no longer relevant.

With a Semantic Data Warehouse, there is a fundamental assumption that the schema is never finished.  It evolves.  You can have different groups working on different parts of the problem and integrate their pieces as needed later on!  This is Cooperation without Coordination.  It enables small, nimble teams to work towards a big goal without stepping on each other’s toes.  And it’s possible because RDF doesn’t require that you know your schema all in advance.  It doesn’t require that a table has 5 columns and the 4th column is a VARCHAR with 256 character or anything equally inane and artificial.  It is resilient to NULLs and dirty data and schema changes.  In short, Semantic Web technologies addresses the major problem with data warehouses, and with current database inflexibility in general.

So why not a Semantic Data Warehouse?  As before, I still don’t fully trust the Federated alternative due to network performance and other issues.

So if you’ve been burned by Data Warehouse projects in the past, you will be burned again by relational EII technology.  It’s going to have the same limitations on flexibility, and is going to require even greater maintenance costs because there are more moving pieces than with a warehouse.  You will be burned again.

It’s not about the Warehouse or Federation.  It’s about a dated, inflexible model.  That is why semantic technologies matter.  With that flexibility you can do more, faster.

You might also like...

Data Science Use Cases

Read More →