In Defense of the Data Swamp

By on

Click to learn more about author Kimberly Nevala.

Like political discourse in the US, the issue of the Data Lake has been decidedly polarized. Two sides advocating for better data access and faster insight. Each taking drastically different approaches to achieve the goal. To be fair: moderates exist and are – in many cases – successfully navigating their Data Lakes quietly and without fanfare. But the public debate, as well as those brought by many a client, has often centered on two distinctly different approaches.

The first? Pristinely clean and open for public swimming: all abilities welcome. In this model, the technologies underpinning the lake (Hadoop and its brethren) supported non-relational, non-structured information sources. The Data Management practices employed, however, were strictly old-school: rigid controls on onboarding, a focus on upfront modeling and “clean” data/Data Quality. In other words, a Data Warehouse in Big Data’s clothing.

The other? No lifeguard on duty. Riptides may be present. Swim at your own risk. This model also embraced “Big Data” platforms, but eschewed everything analysts found frustrating about the Data Warehouse: the rigorous structure, tightly governed quality requirements, the constraints on what could be included compounded by the restricted flow rate of new content coming in.

Recently, however – as evidenced by the latest spate of articles heralding the demise of the Data Lake – both sides have found common ground: users are drowning in the Data Lake.  Assuming they weren’t too intimidated to dip a toe in the water in the first place.

The problem, perhaps, was the tendency to want to keep things simple and uniform. Uniformly governed. Uniformly ungoverned. Uniformly structured. Uniformly unstructured. Uniformly accessible. Uniformly uniform.

In that quest, perhaps we were too quick to disparage the Data Swamp.

No, a swamp isn’t as visually appealing as a lake. A swamp is also often confused with a toxic dump. But despite its outward appearance, a healthy swamp is a rich, diverse, sometimes messy ecosystem. One in which a profusion of plants and animals thrive in happy symbiosis or grudging respect.

Isn’t that What a Modern Data Ecosystem Needs to be?

Forget catering to only one (analytical) lifeform or expecting everyone to conform to a single pattern. The modern data ecosystem:

  • Recognizes the value, distinct wants and needs of each analytical lifeform: from advanced Data Scientists to operational report writers.
  • Understands that the health of the inhabitants is only as good as the health of their food chain.
  • Enforces rigor to ensure that new material added to the chain serves an intended purpose.
  • Knows which elements are symbiotic and which aren’t.
  • Clearly posts descriptions of the terrain, its intended use and risk: deep waters ahead or public beach – lifeguard on duty.
  • Installs buoys, markers and, when necessary, boardwalks to help folks navigate to the area appropriate for them.
  • Holds each entity accountable for being prepared: thrashing through a swamp unprepared favors the alligators.
  • Clears out unused debris on an ongoing basis: no moldering toxins here!

How healthy is your Data Lake? If the answer is “not well,” perhaps it’s time to give the Data Swamp another chance.

Leave a Reply