Loading...
You are here:  Home  >  Data Education  >  Big Data News, Articles, & Education  >  Big Data Blogs  >  Current Article

Data De-duplication Should Be the Heart of all Big Data Strategies

By   /  June 23, 2014  /  No Comments

by James Kobielus

Administrators of monolithic data architectures in the olden days had it easy. They didn’t have to worry about keeping tabs on where their data was being stored, since it was all in one place. As the author of this recent article noted, “There was a flat file system in which all data was stored in a single file which was either text or comma separated file with charter set defining each and every data. That was the period we don’t have any data structure or data type or storage optimization techniques.”

But that was decades ago. Now, in the era of multi-tiered, hybridized, decentralized big-data infrastructures, most data administrators find themselves playing perpetual whack-a-mole with duplicate data sets. Duplicates of enterprise data are everywhere: in your staging nodes, your data warehouses, your data-science sandboxes, your subject-area analytical data marts, your online archives, and so on.

Data duplication is not necessarily a bad thing. For example, data protection strategies often hinge on having ready online backups of all your key business information. But copies are inherently a hog of your precious storage resources. More than that, out-of-control data duplication makes it more difficult to realize the objectives of your big data initiatives.

Just a few years ago, the theme of “no-copy analytics” was almost synonymous with that of big data. The concept is straightforward: consolidate more of your data in a single repository – be it Hadoop, an MPP RDBMS-based platform, or some other database platform – and move more of your analytic models, algorithms, and applications to execute natively in that repository.

From an operational standpoint, no-copy analytics allows you address any or all three of the following enterprise big data imperatives:

  1. Optimizing the use of your limited storage resources by eliminating unnecessary copies of the data, hence reducing storage costs.
  2. To the extent that you can execute more applications on this consolidated data resource, you can develop more powerful analytics, via MapReduce, YARN, R, and other programming frameworks.
  3. Improve the enterprise data quality through tighter governance on the consolidated hub.

You can call this no-copy analytics platform a “data warehouse,” a “data lake,” “data hub,” a “big-data repository,” or whatever you wish. The actual nomenclature is unimportant.

What is important, if you’re a storage administrator, is the first of these benefits: data consolidation. In other words, you’ll focus on the “no-copy” part of a no-copy analytics strategy and tend to give the analytics component short shrift. This is in fact the perspective of the author of this recent article: “The goal of data consolidation is to reduce cost by eliminating needless copies, while at the same time simplifying data management.”

But if you’re a data scientist, you’ll emphasize the “analytics” applications of no-copy analytics, with the storage-optimization benefits being secondary. Consequently, you’ll focus on consolidation of data from disparate sources onto a single platform only to the extent that it gives you more comprehensive data assets to build and tune your statistical algorithms, models, and applications.

But there’s another important point of view to be considered. If you’re a data stewardship professional, you’ll emphasize the third benefit cited above, rather than the de-duplication and data science applications. To the extent that data is consolidated onto a single corporate master data management hub, that will become the single version of truth from which all downstream applications are served.

So, in the final analysis, you can’t field a viable big data strategy unless you’re serious about de-duplication. This should be a high shared priority for all data analytics professionals.

About the author

James Kobielus, Wikibon, Lead Analyst Jim is Wikibon's Lead Analyst for Data Science, Deep Learning, and Application Development. Previously, Jim was IBM's data science evangelist. He managed IBM's thought leadership, social and influencer marketing programs targeted at developers of big data analytics, machine learning, and cognitive computing applications. Prior to his 5-year stint at IBM, Jim was an analyst at Forrester Research, Current Analysis, and the Burton Group. He is also a prolific blogger, a popular speaker, and a familiar face from his many appearances as an expert on theCUBE and at industry events.

You might also like...

Is Data Governance Solely About Controls on Data?

Read More →