Size Doesn’t Matter. Or Does It? A Rant on Big Data Terms

By Karen Lopez

@datachick

Rant signLast week at Enterprise Data World, I gave a lightning talk (a strictly-enforced time-limited five minute presentation).  Since I can barely manage to fit meaningful presentations in one hour given how much I talk, I decided to go for a rant.  That may not come as a surprise to you.  If it does, welcome new reader.

Last year I also gave a lightning talk on the Myths of Normalization, which I’ve turned into a blog series here, too.  That was also a rant.  I’m seeing a pattern here.  You should probably envision me holding a martini and my tablet while I ranted.  It will put you in the mood. I didn’t actually hold a martini in my hand.  It was on the table beside me.

Since I got good feedback, I decided to share my script for my talk here. By the way, those of you snickering about this rant, know that I’m working on a similar one for the RDBMS area as we speak.  Look for it at a NoSQL or Big Data event near you.

Size Doesn’t Matter.  Or Does It?

I’m @Datachick. I think a lot about data.  Today I’m on a rant.  I know.  SHOCKING!!!

I’m a huge fan of Big Data and NoSQL. Really.  A really, really big fan.  Get it?  BIG DATA.  Today I want to share with you some of my more snarky observations about BIG DATA.  By the way, every single one of these rants is totally unfair, cherry picked and irreverent.  I know. It’s shocking.

Let’s start with the basics: What is Big Data?  I’m here to tell you that nobody really knows.  The good thing about Big Data is just that.  So it can be anything you want it to be. Really.  Just like that nice friendly woman who wanted you to buy her a drink last night.

Here’s a nice definition from Wikipedia, the ultimate source of knowledge for the human race.  But that’s a whole ‘nother rant.

In information technology, big data consists of data sets that grow so large that they become awkward to work with(1)

What the heck kinda definition is that?  Data that’s so big it’s awkard?  I can’t wait to be in that meeting with CEO, CIO and friends.

Big Data vs. data

One of the things I noticed about Big Data is that it is always capitallized when it’s written.  I’m not sure why, because it really isn’t a proper noun.  We don’t capitalize DATA, so why should Big Data be?  I’m pretty sure that capitalizaiton is a way to spot the birth of a silver bullet.  Remember that when the next big thing, HUGE DATA, then GINORMOUS DATA is announced at next year’s EDW. I guess then big data will lose its title caps.

Hadoop

Hadoop is one of the many technologies that has come from the Big Data religion…er…movement…no…solutions.  The great thing about Hadoop is that everything that makes up Hadoop is named Hadoop.  Really.  You can’t make this stuff up.

What do you think the mascot is named?  Yep. You got it.  Not Harvey or Harry, but Hadoop.  Isn’t it just like the new crowd to not worry about giving everything its own distinctive name?  Eventually everything becomes consistent.

Okay, not everything. Other Hadoop-related projects at Apache include:

Remember when technologies had names we could say in front of business people without making them think we were idiots? vCOBOL. BASIC. SQLServer.  Try saying your project is late because your elephant driver needs to be tuned to work with your pig and ZooKeeper.  I want to watch.

Schemaless

One of the great things about Big Data is that usually we don’t know ahead of time what data we are going to get or what answers we need to answer.  Yes, Big Data often means a design that is just a big heap of THINGS related to THINGS.  Makes data modeling easy.  Sort of.  Not really. See, the problem with this is that the schema or the design is embedded in with the “real” data.  So you can add new data in an instant. Often just as the data arrives.   Get ready to sprint your data designs. And by sprint, I mean model at the speed of light.  I hope you are in training now.  Also, be prepared to see autocorrect change schemaless to many different words that the one you meant. Go ahead.  But try it at home, not at work.

Eventual Consistency

I used this term previously.  It’s okay, I’m being consistent.  Unlike most Big Data technologies.  See, there’s this concept of Eventual Consistency that says that data has controlled duplication across nodes.  Well, sorta controlled.  See in the Big Data world, it’s okay that the results of the query you run can produce different values than when I run it  Eventally at some point we will get this same result.  Just like a broken clock is right twice a day.  Except in Europe.

I don’t know about you, but I want to know that my version of my bank account balance is the same one that the bank is using to process my cheques.

And don’t even get me started on the people who say that “Eventually the customer will call and ask us to correct the data if it is important to him.”  Seriously?  What world does this guy live in? Talk about living in the clouds.

All I can say is Consistent my ASCII.

Finally…

I’ve been snarky here.  But there really isn’t a reason to think that Big Data, NoSQL and the likes are competitors of traditional database technologies.  We need to be using the right tools for the right job.  Size Doesn’t Matter.

Schemaless is perfect for designing data solutions where you don’t know or really care about perfect data integrity.  Think about getting a data feed from an external source where you have no control of what they send you. That flexibility works.

Eventual consistency is just fine for many applications. Who really cares whether everyone sees your Facebook update at the same time? Or whether your iTunes receipt shows up hours later? They don’t do exchanges or refunds anyway.

I suggest you read up on Big Data, attend some talks like the ones here, find out what applications are using Hadoop and other non-relational technologies.  They will need data experts and you want to be ready when they need us.

And they involve data.  Love your data by using the right solutions.

(1) Wikipedia contributors. “Big data.” Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 1 May. 2012. Web. 1 May. 2012.

(2)  ”Welcome to Apache Hadoop”, hadoop.apache.org, 1 May. 2012,  Web. 1 May. 2012

Related Posts Plugin for WordPress, Blogger...
photo by: Nesster

Karen Lopez

Karen Lopez is Sr. Project Manager and Architect at InfoAdvisors. She has 20+ years of experience in project and data management on large, multi-project programs. Karen specializes in the practical application of data management principles. She is a frequent speaker, blogger and panelist on data quality, data governance, logical and physical modeling, data compliance, development methodologies and social issues in computing. Karen is an active user on social media and has been named one of the top 3 technology influencers by IBM Canada and one of the top 17 women in information management by Information Management Magazine. She is a Microsoft SQL Server MVP, specializing in data modeling and database design. She’s an advisor to the DAMA, International Board and a member of the Advisory Board of Zachman, International. She’s known for her slightly irreverent yet constructive opinions and rants on information technology topics. She wants you to love your data. Karen is also moderator of the InfoAdvisors Discussion Groups at www.infoadvisors.com and dm-discuss on Yahoo Groups. Follow Karen on Twitter (@datachick). 

Tags:

  7 comments for “Size Doesn’t Matter. Or Does It? A Rant on Big Data Terms

  1. John Biderman
    May 21, 2012 at 10:36 am

    Your excellent and entertaining rant brings to mind a serious question (sorry to switch moods).

    At the EDW “Big Panel on Big Data,” Neil Raden made the salient point that having lots more data doesn’t necessarily lead you to better decisions. The trend now is to capture and save every mouse click, every web page visited, every inquiry made, every preference expressed or implied, every connection to other people… every EVERYthing. My question is this: has there been any study showing that storing and analyzing everything produces demonstrably better decisions or results than the statistical sampling methods we’ve used and refined over the last century or more?

    • May 21, 2012 at 10:39 am

      Great question. In theory, I’d bet there is a huge set of data (ironically) about accuracy and reliability of large data sets versus accurate samples in research.

      I agree in principle with what Neil said…bigger isn’t always better. But I do believe there is value to be gained, depending on the problem being analyzed, to having more data. As always, it depends.

      • John Biderman
        May 21, 2012 at 10:44 am

        More data as in a larger sample? Or ALL available data? When is it too much?

        I know of one area where large volumes of data collection over a long period of time have led to better predictive models, and that is (hold your snarky comments to yourself) weather forecasting. Most of the rest of science has gotten by on sampling just because of the sheer impracticality of trying to measure everything. It is kind of like buying a larger house and then going out and buying more stuff (a la George Carlin) to fill it up. Just because data storage has become so relatively cheap, do we really need to fill it up with all this stuff?

Leave a Reply

Your email address will not be published. Required fields are marked *

Add video comment