By Karen Lopez
Last week at Enterprise Data World, I gave a lightning talk (a strictly-enforced time-limited five minute presentation). Since I can barely manage to fit meaningful presentations in one hour given how much I talk, I decided to go for a rant. That may not come as a surprise to you. If it does, welcome new reader.
Last year I also gave a lightning talk on the Myths of Normalization, which I’ve turned into a blog series here, too. That was also a rant. I’m seeing a pattern here. You should probably envision me holding a martini and my tablet while I ranted. It will put you in the mood. I didn’t actually hold a martini in my hand. It was on the table beside me.
Since I got good feedback, I decided to share my script for my talk here. By the way, those of you snickering about this rant, know that I’m working on a similar one for the RDBMS area as we speak. Look for it at a NoSQL or Big Data event near you.
Size Doesn’t Matter. Or Does It?
I’m @Datachick. I think a lot about data. Today I’m on a rant. I know. SHOCKING!!!
I’m a huge fan of Big Data and NoSQL. Really. A really, really big fan. Get it? BIG DATA. Today I want to share with you some of my more snarky observations about BIG DATA. By the way, every single one of these rants is totally unfair, cherry picked and irreverent. I know. It’s shocking.
Let’s start with the basics: What is Big Data? I’m here to tell you that nobody really knows. The good thing about Big Data is just that. So it can be anything you want it to be. Really. Just like that nice friendly woman who wanted you to buy her a drink last night.
Here’s a nice definition from Wikipedia, the ultimate source of knowledge for the human race. But that’s a whole ‘nother rant.
In information technology, big data consists of data sets that grow so large that they become awkward to work with(1)
What the heck kinda definition is that? Data that’s so big it’s awkard? I can’t wait to be in that meeting with CEO, CIO and friends.
Big Data vs. data
One of the things I noticed about Big Data is that it is always capitallized when it’s written. I’m not sure why, because it really isn’t a proper noun. We don’t capitalize DATA, so why should Big Data be? I’m pretty sure that capitalizaiton is a way to spot the birth of a silver bullet. Remember that when the next big thing, HUGE DATA, then GINORMOUS DATA is announced at next year’s EDW. I guess then big data will lose its title caps.
Hadoop
Hadoop is one of the many technologies that has come from the Big Data religion…er…movement…no…solutions. The great thing about Hadoop is that everything that makes up Hadoop is named Hadoop. Really. You can’t make this stuff up.
- http://hadoop.apache.org/common Hadoop Common: The common utilities that support the other Hadoop subprojects.
- http://hadoop.apache.org/hdfs Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
- http://hadoop.apache.org/mapreduce Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.(2)
What do you think the mascot is named? Yep. You got it. Not Harvey or Harry, but Hadoop. Isn’t it just like the new crowd to not worry about giving everything its own distinctive name? Eventually everything becomes consistent.
Okay, not everything. Other Hadoop-related projects at Apache include:
- http://avro.apache.org Avro™: A data serialization system.
- http://cassandra.apache.org Cassandra™: A scalable multi-master database with no single points of failure.
- http://incubator.apache.org/chukwa Chukwa™: A data collection system for managing large distributed systems.
- http://hbase.apache.org HBase™: A scalable, distributed database that supports structured data storage for large tables.
- http://hive.apache.org Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- http://mahout.apache.org Mahout™: A Scalable machine learning and data mining library. Elephant Driver
- http://pig.apache.org Pig™: A high-level data-flow language and execution framework for parallel computation.
- http://zookeeper.apache.org ZooKeeper™: A high-performance coordination service for distributed applications.
Remember when technologies had names we could say in front of business people without making them think we were idiots? vCOBOL. BASIC. SQLServer. Try saying your project is late because your elephant driver needs to be tuned to work with your pig and ZooKeeper. I want to watch.
Schemaless
One of the great things about Big Data is that usually we don’t know ahead of time what data we are going to get or what answers we need to answer. Yes, Big Data often means a design that is just a big heap of THINGS related to THINGS. Makes data modeling easy. Sort of. Not really. See, the problem with this is that the schema or the design is embedded in with the “real” data. So you can add new data in an instant. Often just as the data arrives. Get ready to sprint your data designs. And by sprint, I mean model at the speed of light. I hope you are in training now. Also, be prepared to see autocorrect change schemaless to many different words that the one you meant. Go ahead. But try it at home, not at work.
Eventual Consistency
I used this term previously. It’s okay, I’m being consistent. Unlike most Big Data technologies. See, there’s this concept of Eventual Consistency that says that data has controlled duplication across nodes. Well, sorta controlled. See in the Big Data world, it’s okay that the results of the query you run can produce different values than when I run it Eventally at some point we will get this same result. Just like a broken clock is right twice a day. Except in Europe.
I don’t know about you, but I want to know that my version of my bank account balance is the same one that the bank is using to process my cheques.
And don’t even get me started on the people who say that “Eventually the customer will call and ask us to correct the data if it is important to him.” Seriously? What world does this guy live in? Talk about living in the clouds.
All I can say is Consistent my ASCII.
Finally…
I’ve been snarky here. But there really isn’t a reason to think that Big Data, NoSQL and the likes are competitors of traditional database technologies. We need to be using the right tools for the right job. Size Doesn’t Matter.
Schemaless is perfect for designing data solutions where you don’t know or really care about perfect data integrity. Think about getting a data feed from an external source where you have no control of what they send you. That flexibility works.
Eventual consistency is just fine for many applications. Who really cares whether everyone sees your Facebook update at the same time? Or whether your iTunes receipt shows up hours later? They don’t do exchanges or refunds anyway.
I suggest you read up on Big Data, attend some talks like the ones here, find out what applications are using Hadoop and other non-relational technologies. They will need data experts and you want to be ready when they need us.
And they involve data. Love your data by using the right solutions.
(1) Wikipedia contributors. “Big data.” Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 1 May. 2012. Web. 1 May. 2012.
(2) ”Welcome to Apache Hadoop”, hadoop.apache.org, 1 May. 2012, Web. 1 May. 2012


















Your excellent and entertaining rant brings to mind a serious question (sorry to switch moods).
At the EDW “Big Panel on Big Data,” Neil Raden made the salient point that having lots more data doesn’t necessarily lead you to better decisions. The trend now is to capture and save every mouse click, every web page visited, every inquiry made, every preference expressed or implied, every connection to other people… every EVERYthing. My question is this: has there been any study showing that storing and analyzing everything produces demonstrably better decisions or results than the statistical sampling methods we’ve used and refined over the last century or more?
Great question. In theory, I’d bet there is a huge set of data (ironically) about accuracy and reliability of large data sets versus accurate samples in research.
I agree in principle with what Neil said…bigger isn’t always better. But I do believe there is value to be gained, depending on the problem being analyzed, to having more data. As always, it depends.
More data as in a larger sample? Or ALL available data? When is it too much?
I know of one area where large volumes of data collection over a long period of time have led to better predictive models, and that is (hold your snarky comments to yourself) weather forecasting. Most of the rest of science has gotten by on sampling just because of the sheer impracticality of trying to measure everything. It is kind of like buying a larger house and then going out and buying more stuff (a la George Carlin) to fill it up. Just because data storage has become so relatively cheap, do we really need to fill it up with all this stuff?
[...] blog post with the Big Data rant is now up on [...]
[...] for me (and, in one lightning session, hilariously lampooned by Karen Lopez, which she shared here). After a few years of learning and thinking about this stuff, including multiple listenings to [...]
[...] The various talks ran the gamut from Master Data Management to writing and publishing one’s own book on data. Some speakers used the opportunity to demo their organization’s latest product, while others used it to provide a measure of humor and levity to the proceedings, especially InfoAdvisor’s Karen Lopez and her rumination on Big Data, “Size Doesn’t Matter.” [...]
[...] Blog: Size Doesn’t Matter…or Does it? A Rant on Big Data Terms [...]