Could the Data Mesh Solve Your Data Lake Scaling Issues?

By on

Click to learn more about author Mathias Golombek.

Is data mesh architecture the right approach for your organization and its data democratization journey? In my recent blog series, I delved into one of 2021’s hottest data topics – data democratization – exploring how it can fit into a business’ overarching data strategy along with some practical advice on how to implement data democratization in your own organization. 

For today’s follow-up, I’m introducing another contemporary data concept: the data mesh. I’ll explore the link between data democratization and data mesh as a means to connect siloed data and create a self-service data infrastructure that makes data highly available and easily discoverable for the people who need it. 

To be clear, I’m not advocating data mesh as a silver bullet to all the issues people experience with data lakes. It’s a concept that works for some, but not everyone. Ultimately, you’ll need to make up your own mind.

So, let’s get started.

What Is Data Mesh Architecture?

The cloud is one of, if not the, most disruptive driver of radically new Data Architecture approaches. But to fully understand what’s driving the need for data mesh, we need to appreciate the mess many organizations find themselves in when they try to scale their data.

Ananth Packkildurai’s article in Data Engineering Weekly contains a great analogy for the sad state of data infrastructure in many organizations. He likens the modern data generation process to the equivalent of writing a dictionary without any definitions, shuffling the words up randomly and then hiring expensive analysts to try and make sense of it all. While this analogy certainly doesn’t apply to every organization, it definitely resonates – and is at the core of why the data mesh principle has gained such a following over the last few years.  

To write about data mesh and not acknowledge the ground-breaking work of its creator, ThoughtWorks consultant Zhamak Dehghani, would be unforgivable. Her papers: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh and Data Mesh Principles and Logical Architecture have become required reading on the topic and I urge you to check them out if you haven’t already.

Why Implement a Data Mesh?

To summarize, Dehgani’s data mesh theory argues that data platforms based on traditional data warehouse or data lake models have common failure modes that mean they don’t scale well. Instead of centralized lakes, or warehouses, data mesh advocates the shift to a more de-centralized and distributed architecture that fuels a self-serve data infrastructure and treats data more as a self-contained product.

Dehghani maintains that as your data lakes grow, so too does the complexity of the Data Management involved. In a traditional lake architecture you’ve typically got producers of data who generate it and send it into to the data lake. However, the data consumers down the line don’t necessarily have the same domain knowledge as the data producer and therefore struggle to understand it. The consumers then have to go back to the data producer to try and understand the data. Depending on whether the producer is a person or a machine the required level of human domain expertise may or may not be available.

By treating data as a product, data mesh pushes data ownership responsibility to the team with the domain understanding to create, catalog, and store the data. The theory is that doing this at the data creation phase brings more visibility to the data and makes it easier to consume. As well as stopping any human knowledge silos forming, it helps to truly democratize the data because data consumers don’t have to worry about the data discovery and can focus on experimentation, innovation, and producing more value from the data.

Approach the Data Mesh with Caution 

That’s the theory, anyway. Despite data mesh architecture gaining a lot of traction, there are concerns in the industry about its application. And, of course, there are plenty of strong advocates for the benefits of data warehouses and lakes. Going a stage further, my colleague Helena Schwenk recently blogged on the new concept of the data “lakehouse” as a means to increasing the flexibility of modern data infrastructures. 

As I said at the start, data mesh isn’t a panacea. But if you do go down this route, getting your tech stack right – or as right as possible – will be crucial to data mesh efforts. You need a very powerful central system that can handle all this diverse access.

Learning from the Pioneers 

If you’re looking to implement the data mesh architecture, let me share a few examples of companies you can learn from, who’ve been very open and transparent about their journeys. 

Netflix processes trillions of events and petabytes of data a day. As it has scaled up original productions, data integration across the streaming service and the studio has become a priority. So Netflix turned to data mesh as a way to integrate data across hundreds of different data stores in a way that enables it to holistically optimize cost, performance, and operational concerns. This great YouTube video explains more.

Europe’s biggest online fashion retailer – and Exasol customer – Zalando has also been on a journey from a centralized data lake towards embracing a distributed data mesh architecture. Here’s another great YouTube video from NDC Oslo where Max Schultze outlines Zalando’s ongoing efforts to make the creation of data products simple.

What do you think about data mesh architecture? Share your ideas in the discussion section below!

Leave a Reply