You are here:  Home  >  Data Education  >  Big Data News, Articles, & Education  >  Big Data Blogs  >  Current Article

How Much Big Data Do You Actually Need?

By   /  March 13, 2013  /  No Comments

by James Kobielus

The point of big data is to extract deeper intelligence from data while eliminating the scale constraints that have frustrated traditional business analytics initiatives. Fundamentally, big data is all about matching the data analytics platform scale to the business challenges you’re trying to address. It’s not about treating one massive scale as the panacea for all projects.

In fact, brilliant business insights can and often do emerge from “small data.” Most business intelligence (BI) initiatives serve their core functions quite well in “small data” territory: low terabytes, batch processing, and structured data sources. As a general rule, you should start and stay simple on your BI strategies and deployments unless you have a compelling reason to build a more complex BI system. Most BI is just focused on delivering basic reports, and you may not need fancy dashboards, predictive models or continuous data updates. That’s because you may have just one or two data sources, no fancy statistical modeling requirements, and only a few users doing ad-hoc query and batch reporting for decision support.

But there are a growing range of business analytics requirements that absolutely need big data. The hardcore applications for big data are any analytics application that delivers business results most effectively at the more extreme data volumes, velocities, and varieties. I’d like to propose several categories of use cases that are in big data’s sweet spot, and for which “small data” approaches, such as traditional BI, are not well-suited:

  • Whole-population analytics: This refers to any application that requires interactive access to the entire population of analytical data, rather than just to convenience samples, subsets, or slices.
  • Microsegmentation analytics: This refers to any application requiring fine-grained segmentation of entities described in the underlying data sets.
  • Behavioral analytics: This refers to any application requiring deep data on the behavior of entities (e.g., humans, groups, system components) and the relationships among them.
  • Unstructured analytics: This refers to any application that analyzes a deep store of data sourced from enterprise content management systems, social media, text, blogs, log data, sensor data, event data, RFID data, imaging, video, speech, geospatial, and more.
  • Multistructured analytics: This refers to any application that requires unified discovery, acquisition, storage, management, and analysis of all data types, ranging from structured to unstructured.
  • Temporal analytics: This refers to any application that requires a converged view across one or more time-horizons: historical, current,¬†and predictive.
  • Multivariate analytics: This refers to any application that requires detailed, interactive, multidimensional statistical analysis and correlation requires a big data platform that can execute these models in a massively parallel manner.
  • Multi-scenario analytics: This refers to any application requiring you to model and simulate alternate scenarios, engage in free-form what-if analysis, and forecast alternative future states. This requires a big data platform that supports fluid exploration without needing to define data models up front.
  • Sensor analytics: This refers to any application that requires automated sensors to take measurements and feed them back to centralized points at which the data are aggregated, correlated, and analyzed.

These are the sorts of applications for which various big-data platforms–such as enterprise data warehouses (EDWs), Hadoop, NoSQL, and in-memory databases–may be best suited, either individually or in tandem. All of these approaches give you the headroom to scale out massively along any or all of the core “Vs” (volume, velocity, variety) as your analytical needs grow.

The popular focus on the extreme scale of big data distracts from the fact that many such initiatives had to start somewhere. That “somewhere”–the big data on-ramp and nucleus–is usually in “small data” territory. More often than not, the big data nucleus will be your existing data analytic infrastructure: the data mart, data mining, BI, and so forth. Even now, your data scientists may be playing with Hadoop clusters in the low terabytes, confident that there are no impediments to elastic scaling into the petabytes when their MapReduce, machine learning, content analytics, and other models must be put into full production.

As your needs evolve in all of these areas, you will some day need to scale into big data territory. You may not require massive capacity immediately, but you know you will some day soon, and it’s good to have a big-data platform that can scale out without impediment.

About the author

James Kobielus, Wikibon, Lead Analyst Jim is Wikibon's Lead Analyst for Data Science, Deep Learning, and Application Development. Previously, Jim was IBM's data science evangelist. He managed IBM's thought leadership, social and influencer marketing programs targeted at developers of big data analytics, machine learning, and cognitive computing applications. Prior to his 5-year stint at IBM, Jim was an analyst at Forrester Research, Current Analysis, and the Burton Group. He is also a prolific blogger, a popular speaker, and a familiar face from his many appearances as an expert on theCUBE and at industry events.

You might also like...

Case Study: Using Data Quality and Data Management to Improve Patient Care

Read More →