Mind the Gap: Data Rabbits

Show of hands: How many of you have ever heard corporate leadership say something like, “We are moving everything into the cloud and closing down our data center so we can scale faster and save money”? Maybe you heard the slightly constrained version, “We are moving all of our analytics into the cloud and getting rid of our on-premise data warehouse so we can scale faster and save money.” How’d that work out?

Let’s focus right now on data and analytics, but the same ideas apply to other areas. It’s so very alluring, especially at first. After all, moving everything to the cloud appears to resolve the two biggest complaints that business has of IT: speed and cost. Too little of the first and too much of the second. The cloud promised to fix it all, at least according to the 8×10 glossies: self-service, unlimited space, and lower cost. And empowerment!

It usually starts this way:

In the cloud we don’t have to wait for the data team. Their backlog is overflowing and it’ll be at least six months before they get to my project. I need answers now. Besides, how do I even know that their data will solve my problem? And don’t get me started with the governance committee and their ivory tower processes. Supposedly I got assigned to be a data sewer or steward or something.

Anyway, I have a couple people who know a couple things. We’ll just pull the data and build it ourselves. Call the cloud platform salesperson that met with the team a couple months ago. He said that we could build it faster and for less than the data team had quoted. Just activate some tools, point-and-click, and we’ve got a data pipeline, repository, access tools, and numbers we can trust (because they’re our numbers). And storage is practically free.

It always starts out free. Free like a puppy.

Data Architecture Bootcamp

Learn how to design modern data architectures that unify operational, analytical, and AI data – September 2026.

Enroll Today

With apologies to Rush, choosing to not manage data is itself a strategic choice. It’s understandable. It’s self-reinforcing. And for a while the strategy appears to be successful. Results are delivered faster. Chicken lunches all around! The siren song is heard by leadership in other areas and they want to complete their analyses and deploy their applications quickly, too. So, they imitate the pattern, creating their own data feeds and their own repositories. Now I have numbers I trust and I got them fast. More chicken lunches!

Before long, the analytical environment starts to receive industry kudos. Everybody wants to be the center of attention at the Executive Leadership Summit. Who wants to be bothered with a measly terabyte-sized analytical environment. Our company’s is approaching an exabyte. Do you even know what an exabyte is?

The result is data multiplying uncontrollably like … you know.

And then one day, the numbers shared by two organization heads in an executive meeting don’t match. They haven’t been matching for a while, but at least they were pretty close. Close enough. Now, they’re not so close. A meeting that had been called to address a business issue is now consumed with arguing over whose numbers are correct. A tiger team is created to identify the source of the differences and reconcile the numbers.

It isn’t long before it happens again. And again. We’re going to run out of tiger teams.

But the problems don’t end there. Your one data feed appeared to be less expensive to deploy. And his. And hers. And theirs. Over time, the company ends up with duplicate pipelines processing and storing the same data. Each little repository brings with it its own ingestion logic, transformations, storage space, and maintenance burden. And then these little repositories become data sources with their own downstream dependencies. Nobody who’s deployed a private repository because it’s cheaper and easier did it anticipating that they would be supporting anybody else.

The enterprise loses its economies of scale.

Remember those mismatched report numbers and the tiger teams tasked with reconciling them? Maybe there are some people who relish the opportunity to do data reconciliation projects, but I’m not sure I know many if any. Lineage is hard to trace, and debugging is more like archeology.

What started out as a way to reduce time and cost is now failing on both fronts.

By the time you realize that you have been lured down a bad path, you’ve got a huge mess on your hands. Some companies will doggedly cling to this approach, even convincing themselves that this is a preferred approach. I believe that they do recognize, or eventually recognize that they are fooling themselves but continue in that direction anyway not because that’s what they want to do, but because they find changing course too daunting. Well, yes. The longer you keep making a mess the longer it’s going to take to clean up. Entropy only moves in one direction and it takes effort to reverse it.

There is something between overburdensome governance and data anarchy.

Unrestrained replication may also be a byproduct of technology selection. File-based repositories, especially the raw object stores used in most data lakes, struggle with WHERE clauses and joins. Therefore, context-specific result sets are stored. You end up with files for each report, dashboard, data mart, and application. And why not? Cloud processing is expensive and disk space is practically free and besides, my results are already there when I need them. We’ve already seen why not.

So, what can we can do about it?

In many cases, the answer will appear to be “nothing,” or at least very little. We might see the challenges that accompany uncontrolled replication, but the inevitable early successes will very visibly contradict our warnings.

This leaves us with two choices. The first is to salute and execute. Let the pieces fall where they may. Of course, we’ll have to clean up the mess later. Don’t expect credit for having foreseen the problem. You may have been correct, but at most companies it’s career-limiting to remind management of the fact.

The second is to:

Look for opportunities to influence architecture and process in ways that limit the damage, even if it is only around the edges for the time being.

Here are a few you can start with.

1. Sandbox Management

A mature enterprise analytics architecture will include independent repositories created as sandboxes for exploring new use cases and prototyping. This is a good thing. Accommodate them. Plan for them. Encourage them. But be sure to manage them and to isolate them.

Do not allow sandboxes to become permanent and do not allow sandbox content to become public.

The processes to request and create sandboxes should be as fast and as frictionless as possible. The same for promotion to production, but proper curation is required. (If you’ve got ongoing uncontrolled replication, you probably aren’t pursuing a data product strategy, so curation, at minimum, must include definitions, expected content, authoritative sources, and security and privacy requirements.) Of course, demand for even the slightest curation will probably be considered an insurmountable obstacle with the attendant crying, wailing, and threats of missed deadlines. Hold your ground!

In response, many teams will say, “Fine, you need curation to move this into production. I don’t want to take the time to do that so I’m just going to continue to use it as it is, where it is.” This is where having hard sandbox expiration dates becomes critical. Otherwise, you’re going to have uncontrolled sandbox replication. Not an improvement.

Establish the expiration date when the sandbox is created, and delete it on that date. Automate the process so nobody has to remember to do it, or risk being talked into an extension. You’re going to find out how committed your management is to a rational information architecture the first time somebody wants an extension. And then another. And another. And another.

Sandbox content must also be isolated. Even if the corporate data bloodstream is contaminated with uncontrolled data from everywhere else, you’re just trying to keep it from getting worse. Isolation also prevents these sandboxes from becoming permanent. As soon as one downstream process becomes dependent upon sandbox content, the sandbox becomes infinitely more difficult to expire. That’s often the point. It’s called “burrowing”: Increase dependency to ensure permanence. Recognize that objective and head it off.

2. Foundational Data Products

Even in an uncontrolled environment, data products can be incrementally introduced to begin to improve consistency. I’ve talked a lot about data products elsewhere so I won’t spend much time here. Start with foundational data products to establish a clean layer of standardized, validated data. Introduce quality measurement to increase reliability. More importantly, begin to demonstrate the benefits of proper management and draw a contrast between the controlled and uncontrolled data.

Next, layer on canonical metrics in composed data products with shared definitions to prevent conflicting numbers. Emphasize to the development teams that:

Everybody can build their own pipelines, but nobody can define their own meaning of data.

Of course, none of this will work with heavy, centralized approval processes. Leverage existing distributed processes as much as possible. Automate as much as possible. Make doing things the right way the easy way.

3. Communication

I’ve said it many, many times:

The best and sometimes only leverage that an enterprise analytics team has is communication.

Identify the intersection of metrics that will resonate and metrics that will drive progress toward more rational analytics environment management.

Perhaps display on your website counts of the core, application, and user tables/files as well as the disk space consumed. Make the costs visible. Be sure to include the people required by each team to manage and support the repositories and pipelines. After all, each team may only be consuming a relatively small amount, but it all adds up. Make the list of data tickets/issues/questions public. Quantify the expense associated with data reconciliations and delayed delivery through after action investigations. Stop wasting money without even knowing it. At least know it.

Now, you’re going to get pushback. Expect it. Be prepared for it. Nobody likes having their sins exposed.

When you start publishing metrics, don’t make it personal. Just give aggregate results. Just show the totals.

Maintain the detail behind the scenes but don’t publicize it. Eventually, somebody’s going to want to see the results by organization or individual. You can reluctantly agree. After all, you don’t want to call anybody out specifically. You just want what’s best for the enterprise. But if you want that information out there, then I will do that for you.

Many organizations approach establishing this kind of communication from the other direction, with individuals and departments on Hall of Fame or Hall of Shame rankings. Starting this way increases the probability that you’re going to get shut down right out of the gate by a disgruntled Hall of Shame inductee who complains to management. The larger the organization, the more likely this is to happen.

4. Reward Sharing

This also falls under the heading of communication, but deserves its own bullet point. Most companies incentivize empire-building. Create the new widget and you get kudos and chicken lunches. Leverage something someone else built, delivering in a fraction of the time, and it goes unnoticed.

Do not reward the creation of something new before asking why something existing wouldn’t work.

Incentivize reuse. Incentivize sharing. Incentivize contributing improvements. If the culture rewards speed above all else, you’ll get uncontrolled replication no matter how good the architecture is.

Expose duplication and promote sharing.

All of this requires culture change, and that may be the hardest thing of all to influence. Do what you can (which is what you’re already doing). Set a good example. Decentralized execution requires centralized standards and a culture that enforces discipline. Unfortunately, most organizations don’t have either. In fact, most have the opposite: decentralized standards and no data discipline.

Analytical replication driven purely by demand drifts into chaos.

Teams need autonomy, but within governed systems. Without it, you’re wasting money and buying yourself a future reconciliation and consolidation project. Get it right the first time.

Decentralization has been successful in organizations where replication happens at the edges, but the core data is still standardized. Duplication is intentional, not accidental or reflexive. Inconsistency is identified, described, and quantified, and never simply ignored or tolerated.

Data rabbits have been multiplying in most every company for a long time. Start taming them.

Developing Data Products

Gain the knowledge and tools to design, build, and deploy impactful data products that drive business value and innovation.

Mind the Gap: Data Rabbits

Data Architecture Bootcamp

Developing Data Products

Mark Cooper

The Multimodal Lakehouse: Why Your Data Strategy Needs to Evolve Beyond Structured Data

The Data-Centric Revolution: The Strangler Fig Pattern

AI Is Increasing the Strategic Importance of Data Modeling

Thanks!

Mind the Gap: Data Rabbits

Data Architecture Bootcamp

Developing Data Products

Mark Cooper

Related Articles

The Multimodal Lakehouse: Why Your Data Strategy Needs to Evolve Beyond Structured Data

The Data-Centric Revolution: The Strangler Fig Pattern

AI Is Increasing the Strategic Importance of Data Modeling

Lead the Data Revolution from Your Inbox.

Thanks!