Advertisement

Distributed Data Architecture Patterns Explained

By on

Distributed data architecture, models using multiple platforms, and processes for data-driven goals continue to generate increased interest. As William McKnight, president of McKnight Consulting Group (MCG) and well-known data architecture advisor, says, “Seldom a database vendor does not interact with concepts around distributed data architectures: the data lakehouse, data mesh, data fabric, and data cloud, and I am sure you find it true for your interactions.”

In a recent Advanced Analytics (ADV) webinar, McKnight provided information about choosing among distributed data architecture patterns to meet business goals. He provided a high level of each option and the steps needed to implement them. Most importantly, he enabled his audience to understand why to think about a distributed data architecture and what combination would work best with their business environment.

Why Consider Distributed Data Architecture Patterns

Options for distributed data architectures came about in response to the advantages and limitations of monolithic and centralized models. At first, organizations turned to data warehouses between the 1980s and 2000s – a structured business information store for all enterprise data – to process keyboard inputs.

Later, in the 2010s, raw streaming data from applications, like social media, required a different data configuration. As a result, data lakes emerged to handle ingested data, taking on a variety of formats and storing this data cheaply.

While data lakes provide flexibility unmet by data warehouses, they lack their advantages. McKnight stated, “With data warehouses, you can have transactions if you want and enforce great Data Quality.”

Companies now want the best of both warehouses and lakes to meet the realities of their goals. 

These business requirements include alignment among different operating systems in different ways to promote data sharing, with “adherence to domain-specific boundaries and certain business areas,” explained McKnight. 

For example, different financial departments can use common information about the same customer with a checking and credit card account at the same bank. In the meantime, each office only sees the data it needs to process its transactions, stay compliant with regulations, and protect customer privacy. A good combination of distributed data architecture patterns satisfies both needs.

How to Evaluate Distributed Data Architecture Combinations

According to McKnight, a business should keep its priorities front and center rather than focusing on a single distributed data architecture configuration. This step includes not getting mired in technical commonalities.

Instead, think of each distributed data architecture pattern as giving guidance through its theories, validated by science and tried-and-true ideas. When applying this information, see each blueprint as part of a combination and not a one-size-fits-all, as the diagram shows below:

data architecture patterns
Image source: MCG

The best synergy “depends on factors like where an organization comes from, the technologies and architecture it has implemented, and the skills in constructing the architecture,” advised McKnight. 

Pull ideas and take the time needed for adherence, he said, before choosing among distributed data architecture models. Also, have a solid data foundation, e.g., a standardized Data Quality framework, behind these architectures when implementing them.

Distributed Data Architecture Patterns

Distributed data architecture patterns include the data lakehouse, data mesh, data fabric, and data cloud. Each is described below.

Data Lakehouse

The data lakehouse, a term coined by Databricks, means a combination of a data lake and a data warehouse. It emerged as an entry point into distributive architecture patterns, noted McKnight, and has generated the most discussion. 

He explained that while various outfits have coined different terms, they essentially talk about the concept of a data lakehouse. McKnight added: 

“All major vendors have converged their messaging around the concept of the lakehouse architecture. They take the best attributes of a data warehouse and enable them to run on data-like storage, specifically cloud-like storage. Users query from the data warehouses, which applies smart programming to reach through them, drill to, and get data from the data lake. These algorithms match previously unexecuted queries on the data lake.”

A data lakehouse provides organizations a unified data platform, streamlining their overall data management processes. This setup lets the end-user quickly get the data they need in the provided format. Moreover, the data lakehouse offers flexible storage that scales and supports streaming or batch processing.

While data lakehouses feature metadata layers between the warehouse and lake to handle the drill-to paths, according to McKnight, they have some drawbacks. For example, he explained that lakehouses have difficulty mixing appends and reads that users need to transform and get the data simultaneously.

Also, the technology has challenges combining batch and streaming simultaneously. However, the savings on administration and standardization makes the data lakehouse a prime candidate as a distributed data architecture option.

Data Mesh

The data mesh architectural pattern acknowledges that organizations will have multiple data warehouses and lakes and recommends four core principles. This methodology focuses on context and “decentralizes and decouples architectural elements, by domain,” stated McKnight.

He compared a data mesh construction to a microservices approach in development, where each domain functions independently but needs to work with the other business areas to use an entire organization’s product or service. Companies typically figure out their domain structures through conceptual data modeling when achieving this objective.

Data mesh has advantages that entice organizations. They include data democratization, cost efficiencies, and “reducing data silos and operational bottom lines,” said McKnight. Furthermore, the data mesh concept supports good security and compliance, self-service applications, BI dashboards, personalized experiences, and machine learning (ML) projects.

While conceptually simple, the data mesh requires multiple data warehouses, lakes, and intake layers, which can increase technical complexity. Additionally, it requires solid construction of domains and their Master Data Management (MDM) to work.

Data Fabric

Data fabrics combine intelligent and automated algorithms, unifying disparate data across systems, access to integrated enterprise data, and more effortless scalability as organizations grow. McKnight likened the data fabric architectural pattern to data virtualization, a data integration technology providing access to data in real time.

McKnight observed that no matter what data model defines an organization’s architecture – e.g., data lakehouse, data mesh, or data cloud – data fabric plays a role in providing standard shared services and application portability. Metadata drives these benefits by giving systems using AI or analysts access to data everywhere.

Organizations choose a data fabric architecture for its ML, data democracy, and consistency in applying data security rules. Additionally, data fabric shines in fraud detection, preventative maintenance of the whole system, customer profiling, and risk modeling.

Consider MDM, as advised earlier in data mesh, when considering data fabric. Such an architectural component provides the data quality necessary to make integration within a data fabric feasible.

Data Cloud

McKnight described the data cloud as a newer distributed data architectural concept, the “fourth leg holding the table,” and the evolution of an organization’s data cloud. He acknowledged that this term has recently emerged and is tied somewhat to the vendor Snowflake.

Unlike Snowflake’s definition, McKnight considers the data cloud more broadly. He likened it to a data marketplace, providing live access to query data with a few clicks.

Such a setup allows an organization to share and exchange data with subsidiaries, partners, third parties, or general users on the internet. Multiple interoperable clouds underlie the data cloud architecture, connecting syndicated data and data for AI algorithms across organizations.

McKnight indicated that this concept of a data cloud is an emerging distributed data architecture. But, as enterprises utilize and monetize their data, they will grow ideas and possibilities for data products. Over the next few years, he thinks people will use and work with data products in this data cloud.

Conclusion

Distributed architecture patterns promise combinations of architectural components for more efficient data processing, better data sharing, and cost savings. McKnight summarized the advantages of each as follows:

  • Data lakehouse: Drill-through pathing so the end-user can easily access the data they need
  • Data mesh: Decentralized and decoupled architectural parts according to context 
  • Data fabric: Connectivity that provides common shared services and application portability, making automation possible by applying metadata patterns
  • Data cloud: The unification of a single copy of an organization’s data and the external data it transmits to outside customers

McKnight concluded by emphasizing that the best architectural implementations help the organization thrive. His final advice was, “Meet your business goals with whatever architecture you implement. You want to end up with one that is right for you.” 

Watch the Advanced Analytics webinar here:

 

Image used under license from Shutterstock.com