Data Lakes: Cleaning Up Data’s Junk Drawer

Click to learn more about author Paul Brunet.

We all have that place where we end up stashing those things we think we’ll need or want someday. Some of us throw the stuff in a junk drawer in the kitchen. Others squirrel it away to the attic or into a closet in the spare bedroom.

On occasion, we do venture into these storage spaces and unearth certain items that prove to be extremely beneficial in solving a problem. But in most instances, those things we deemed essential in the moment are left in a jam-packed drawer or dark corner of a closet — forgotten and worthless, yet taking up valuable space that could be utilized in some other way.

This is precisely the situation many organizations face today with their data.

A Junk Drawer Full of Data

Today, the amount of data produced by businesses continues to increase at a dizzying speed. Most organizations migrate their data into a Data Lake, thanks to its inherent scalability and flexibility. What goes in the lake, stays in the lake. On the surface, this appears to be a smart business move since data is their most valuable asset.

But dive beneath the surface and you’ll discover that using a Data Lake as a repository without giving consideration to its usage makes it no better than a junk drawer. Sure, the lake may store a vast amount of data, but all of the raw data in the world is of little worth if there isn’t a process in place for unlocking its value. Even worse, there may be private information in that unopened letter that you don’t want others to see.

The vast majority of businesses have Data Lakes that are little more than virtual junk drawers: reservoirs that house data from disparate sources across enterprise. The problem is, most of this data isn’t accessed. In fact, it’s not uncommon for the majority of users to find only a small percentage of truly valuable data sets. The remainder of it is submerged in the lake, an uncataloged, useless jumble of data sets taking up costly space without providing the ROI businesses expect. Users don’t know how to find data sets in the lake — or if they can, it’s difficult and time intensive to distinguish which ones are the best … or if they should have access to it.

Cleaning Out Your Junk Drawer

Data Lakes that are ungoverned, disorderly bodies will never live up to their potential as first-class business assets, and are a lingering privacy liability. When organizations fail to put in place good governance principles for their lakes, users can’t find what they’re looking for — or they don’t know what they’re looking at. In some instances, users may be able to locate data in a lake, but misalign on its purpose. Unaware, these users derive insights from the data that might run counter to the business’s legal and/or strategic policies, or worse, one team interprets the data one way while another team defines it in entirely different ways.

What can you do if your Data Lake has become a junk drawer? And more important, how do you prevent your lake from becoming a junk drawer?

Interestingly, the answer to both questions is the same.

Some believe the purchase and implementation of Machine Learning technology will automatically detect and decipher the data sets in your lake. But it’s been proven time and again that technology alone cannot clean up a Data Lake or keep a lake from becoming a swamp. Rather, using technology to pinpoint the appropriate subject matter experts (SMEs) in your organization who are familiar with the data will get you much further, much faster.

The Clean-Up Process

SMEs must determine what data should — and shouldn’t be — housed in the lake. Once you’ve identified the appropriate SMEs, they can act as the stewards of data governance, identifying and cataloging valid data. They can also help prioritize the most important assets that are directly aligned to the business priorities.

A taxonomy system that uses categories and subcategories to classify and catalog your data (like the Yellow Pages) makes data readily available to users; the specifics about what your categories are and how many you have will be unique to your organization. On an on-going basis, data sets throughout the enterprise will be cataloged, tagged appropriately, and placed in the lake.

Data that is deemed superfluous or private can be archived, deleted or anonymized. (Yes, you read that right. Data that’s irrelevant can and should be deleted.) This frees up space in a lake for data that is deemed a true and lasting business asset.

Collaboration Clears the Way

With all the clutter cleared away, SMEs can work together with users to further refine the data in the lake. Some categories will have an excess of data sets that overlap, while others will be empty or contain an insufficient amount of data. As more users opt for certain data sets over others, that data is certified thanks to the wisdom of the crowd and becomes even more valuable to a company. Users can utilize a workflow to locate the highest-quality data for their needs.

Data alone won’t save us. Neither will a one-size-fits-all storage solution nor the implementation of Machine Learning technology. But with smart collaborative work between subject matter experts and data governance technology, the answer every user is looking for is right in plain sight: in the Data Lake.

TAKE OUR DATA MANAGEMENT CERTIFICATION PREP COURSES

Data Topics

Data Lakes: Cleaning Up Data’s Junk Drawer

Leave a Reply Cancel reply