Let's face it: the whole NoSQL/Big Data thing has traditional data management professionals (in whose ranks I count myself) at best concerned, at worst terrified. Several of our most closely held shibboleths are being sundered. Where's the data model? Where are the keys, the constraints? How do I enforce data quality? How does it conform to ACID? Who are these people saying the relational database is dead?
I've been paying a modicum of attention to the NoSQL discussions over the last few years, trying to wrap my head around what it means to us. My education continued at Enterprise Data World this year, starting with another excellent presentation by NoSQL evangelist and co-founder of the NoSQL Now! Conference, Dan McCreary (an older version of one of his PowerPoints can be found on his web site), followed by some other presentations and panels where things like Hadoop and Mongo and Cloudera got increasingly demystified for me (and, in one lightning session, hilariously lampooned by Karen Lopez, which she shared here). After a few years of learning and thinking about this stuff, including multiple listenings to Dan's expositions, I finally had my ah-ha moment -- or moments.
We can all relax. First of all, relational databases aren't going away. The fact is that systems built on relational databases and those on NoSQL are working in two different problem spaces and will likely coexist for a long time. And as for relations and constraints and data quality, my really big ah-ha was that in the NoSQL/Big Data world it is all about quantity, not quality, and about speed. The role of the NoSQL technologies is to absorb large quantities of data rapidly without any preconception as to its structure and without any notions as to its quality (since NoSQL systems are usually recipients of data originating elsewhere and merely reflect the quality of the originating source). The reason quality doesn't matter? It's because the data will be used to perform aggregate analysis, bunching data together to look, for example, at patterns of an individual (consumer, patient, web surfer, social networker) or a large cohort, in the context of which bad data can either be tossed aside (if you can detect that it looks suspicious) or will be statistically unimportant when included in large aggregations. Volume, speed (both of data acquisition and retrieval, because the technologies are built for parallel processing and everything gets indexed), and schema neutrality (which I think is a better term than the misleading “schema-less”) are the operative concepts.
By contrast, we will continue to have systems in which you need to be very certain that a row of data is valid at the moment it is inserted into the database or else you will have highly negative consequences; one could conjure an almost infinite number of scenarios where this is important. For such applications, the relational paradigm with its structure and constraints – its beneficial rigidity, we could say -- will continue to be a feasible approach.
So with my newfound comfort that the NoSQL technologies are complementing, not necessarily supplanting, our traditional systems, it got me to thinking about applications of the technology within our enterprise. One place with potentially huge benefits would be data warehousing, for a couple of reasons: integration of new data would not require that all the rigid schema and ETL be pre-specified for the warehouse, which is probably the biggest time and money sump in warehousing today (a point Geoff Malafsky, founder of Phasic Systems, has been hammering home at conferences and in online discussions), and because of their relatively easily scaled performance and, it should be noted, easy port to the cloud.
But NoSQL systems from what I’ve gleaned are not terribly efficient forms of data storage. In a relational system we define the metadata once, in the DDL, and the rest is all data which the system understands to conform to the structure defined in the metadata. In NoSQL systems, the metadata are described alongside almost each and every datum. For systems storing large objects like photographs, documents, or videos, the volume of data greatly exceeds the metadata, but it is perfectly possible to have systems where the volume of metadata dwarfs the actual data. In the era of cheap massive storage maybe this is not such an issue, but try telling that to the guys managing the storage arrays at my company.
The other gray area for me, especially in corporate data warehousing, is the important problem of how we capture and describe the business metadata of a NoSQL system. This is something we’ve gotten pretty good at over the last 15 years when we start with our logical and physical model representations and attach definitions to each entity or attribute. In a paradigm in which you effectively acquire the data first and describe it later, how will that work? Please post a comment if you have thoughts on this. It will be the next phase of my NoSQL education… now that I’ve gotten past the anxieties.