My NoSQL/Big Data Epiphany

by John Biderman

Let’s face it: the whole NoSQL/Big Data thing has traditional data management professionals (in whose ranks I count myself) at best concerned, at worst terrified.  Several of our most closely held shibboleths are being sundered.  Where’s the data model?  Where are the keys, the constraints?  How do I enforce data quality? How does it conform to ACID?  Who are these people saying the relational database is dead?

I’ve been paying a modicum of attention to the NoSQL discussions over the last few years, trying to wrap my head around what it means to us.  My education continued at Enterprise Data World this year, starting with another excellent presentation by NoSQL evangelist and co-founder of the NoSQL Now! Conference, Dan McCreary (an older version of one of his PowerPoints can be found on his web site), followed by some other presentations and panels where things like Hadoop and Mongo and Cloudera got increasingly demystified for me (and, in one lightning session, hilariously lampooned by Karen Lopez, which she shared here).  After a few years of learning and thinking about this stuff, including multiple listenings to Dan’s expositions, I finally had my ah-ha moment — or moments.

We can all relax.  First of all, relational databases aren’t going away.  The fact is that systems built on relational databases and those on NoSQL are working in two different problem spaces and will likely coexist for a long time.  And as for relations and constraints and data quality, my really big ah-ha was that in the NoSQL/Big Data world it is all about quantity, not quality, and about speed.  The role of the NoSQL technologies is to absorb large quantities of data rapidly without any preconception as to its structure and without any notions as to its quality (since NoSQL systems are usually recipients of data originating elsewhere and merely reflect the quality of the originating source).  The reason quality doesn’t matter?  It’s because the data will be used to perform aggregate analysis, bunching data together to look, for example, at patterns of an individual (consumer, patient, web surfer, social networker) or a large cohort, in the context of which bad data can either be tossed aside (if you can detect that it looks suspicious) or will be statistically unimportant when included in large aggregations.  Volume, speed (both of data acquisition and retrieval, because the technologies are built for parallel processing and everything gets indexed), and schema neutrality (which I think is a better term than the misleading “schema-less”) are the operative concepts.

By contrast, we will continue to have systems in which you need to be very certain that a row of data is valid at the moment it is inserted into the database or else you will have highly negative consequences; one could conjure an almost infinite number of scenarios where this is important.  For such applications, the relational paradigm with its structure and constraints – its beneficial rigidity, we could say — will continue to be a feasible approach.

So with my newfound comfort that the NoSQL technologies are complementing, not necessarily supplanting, our traditional systems, it got me to thinking about applications of the technology within our enterprise.  One place with potentially huge benefits would be data warehousing, for a couple of reasons: integration of new data would not require that all the rigid schema and ETL be pre-specified for the warehouse, which is probably the biggest time and money sump in warehousing today (a point Geoff Malafsky, founder of Phasic Systems, has been hammering home at conferences and in online discussions), and because of their relatively easily scaled performance and, it should be noted, easy port to the cloud.

But NoSQL systems from what I’ve gleaned are not terribly efficient forms of data storage.  In a relational system we define the metadata once, in the DDL, and the rest is all data which the system understands to conform to the structure defined in the metadata.  In NoSQL systems, the metadata are described alongside almost each and every datum.  For systems storing large objects like photographs, documents, or videos, the volume of data greatly exceeds the metadata, but it is perfectly possible to have systems where the volume of metadata dwarfs the actual data.  In the era of cheap massive storage maybe this is not such an issue, but try telling that to the guys managing the storage arrays at my company.

The other gray area for me, especially in corporate data warehousing, is the important problem of how we capture and describe the business metadata of a NoSQL system.  This is something we’ve gotten pretty good at over the last 15 years when we start with our logical and physical model representations and attach definitions to each entity or attribute.  In a paradigm in which you effectively acquire the data first and describe it later, how will that work?  Please post a comment if you have thoughts on this.  It will be the next phase of my NoSQL education… now that I’ve gotten past the anxieties.

Related Posts Plugin for WordPress, Blogger...

John Biderman

John Biderman has over 20 years of experience in application development, database modeling, systems integration, and enterprise information architecture. He has consulted to Fortune 500 clients in the US, UK, and Asia. At Harvard Pilgrim Health Care (a New England-based not-for-profit health plan) he works in the areas of data architecture standards and policies, data integration, logical data modeling, enterprise SOA message architecture, metadata capture, data quality interventions, engaging the business in data stewardship processes, and project leadership. 

  4 comments for “My NoSQL/Big Data Epiphany

  1. May 21, 2012 at 10:34 am

    I’m with you here, John. The real point to my rant was that while we can try to judge NoSQL solutions against relational technologies, there’s no reason to do so. And, by not so much of a coincidence, there is no reason to judge relational against post-relational solutions.

    When I hear people saying that “relational is dead”, I translate that into “relational is dead for some problem sets”. I also take this “end of the relational world” ranting to be overt sales pitches to put fear into the minds of CIOs. My rule of thumb is that if one has a marketing pitch of only disssing other technologies, then the marketer is showing that he or she really has no message of strength for his own solution.

    Dan gets its right in his course and presentations. It’s Not-Only-SQL. I’m good with that. In fact, I think it’s great when we find solutions with a better fit to a task.

  2. John Biderman
    May 21, 2012 at 10:39 am

    Thanks for the comment Karen. I have a follow-up question that I actually posted as a comment after your blog, cross-referenced here:

    http://www.dataversity.net/size-doesnt-matter-or-does-it-a-rant-on-big-data-terms/#comments

  3. May 24, 2012 at 8:41 am

    Great post John!

    I think you should also give yourself some credit for the groundbreaking work you did using NoSQL Wiki technologies to manage complex healthcare metadata. The social aspect of metadata is critical also in the NoSQL world. Your presentations also made me think about how to integrate the best of Wiki systems into the NoSQL world.

    Your post has some very quotable phrases. Please continue!

Leave a Reply

Your email address will not be published. Required fields are marked *

Add video comment