Testing and Monitoring Data Pipelines: Part One

By on
Read more about author Max Lukichev.

Suppose you’re in charge of maintaining a large set of data pipelines from cloud storage or streaming data into a data warehouse. How can you ensure that your data meets expectations after every transformation? That’s where data quality testing comes in. Data testing uses a set of rules to check if the data conforms to certain requirements.

Data tests can be implemented throughout a data pipeline, from the ingestion point to the destination, but some trade-offs are involved.

On the other hand, there’s data monitoring, a subset of data observability. Instead of writing specific rules to assess if the data meets your requirements, a data monitoring solution constantly checks predefined metrics of data throughout your pipeline against acceptable thresholds to alert you on issues. These metrics can be used to detect problems early on, both manually and algorithmically, without explicitly testing for these problems.

While both data testing and data monitoring are an integral part of the data reliability engineering subfield, they are clearly different. 

This article elaborates on the differences between them and digs deeper into how and where you should implement tests and monitors. In part one of the article, we will discuss data testing in detail, and in part two of the article, we will focus on data monitoring best practices. 

Testing vs. Monitoring Data Pipelines

Data testing is the practice of evaluating a single object, like a value, column, or table, by comparing it to a set of business rules. Because this practice validates the data against data quality requirements, it’s also called data quality testing or functional data testing. There are many dimensions to data quality, but a self-explanatory data test, for example, evaluates if a date field is in the correct format.

In that sense, data tests are deliberate in that they’re implemented with a single, specific goal. By contrast, data monitoring is indeterminate. You can establish a baseline of what’s normal by logging metrics over time. Only when values deviate should you take action and optionally follow up by developing and implementing a test that prevents the data from drifting in the first place.

Data testing is also specific, as a single test validates a data object at one particular point in the data pipeline. On the other hand, monitoring only becomes valuable when it paints a holistic picture of your pipelines. By tracking various metrics in multiple components in a data pipeline over time, data engineers can interpret anomalies in relation to the whole data ecosystem.

Implementing Data Testing

This section elaborates on the implementation of a data test. There are several approaches and some things to consider when choosing one.

Data Testing Approaches

There are three approaches to data testing, summarized below.

Validating the data after a pipeline has run is a cost-effective solution for detecting data quality issues. In this approach, tests don’t run in the intermediate stages of a data pipeline; a test solely checks if the fully processed data matches established business rules.

The second approach is validating data from the data source to the destination, including the final load. This is a time-intensive method of data testing. However, this approach tracks down any data quality issues to its root cause.

The third method is a synthesis of the previous two. In this approach, both raw and production data exist in a single data warehouse. Consequently, the data is also transformed in that same technology. This new paradigm, known as ELT, has led to organizations embedding tests directly in their data modeling efforts.

Data Testing Considerations

There are trade-offs you should consider when choosing an approach.

Low Upfront Cost, High Maintenance Cost

Going for the solution with the lowest upfront cost, running tests solely at the data destination has a set of drawbacks that range from tedious to downright disastrous.

First, it’s impossible to detect data quality issues early on, so data pipelines can break when one transformation’s output doesn’t match the next step’s input criteria. Take the example of one transformational step that converts a Unix timestamp to a date while the next step changes the notation from dd/MM/yyyy to yyyy-MM-dd. If the first step produces something erroneous, the second step will fail and most likely throw an error.

It’s also worth considering that there are no tests to flag the root cause of a data error, as data pipelines are more or less a black box. Consequently, debugging is challenging when something breaks or produces unexpected results.

Another thing to consider is that testing data at the destination may cause performance issues. As data tests query individual tables to validate the data in a data warehouse or lakehouse, they can overload these systems with unnecessary workloads to find a needle in a haystack. This not only brings down the performance and speed of the data warehouse but also can increase its usage costs. 

As you can see, the consequences of not implementing data tests and contingencies throughout a pipeline can affect a data team in various unpleasant ways.

Legacy Stacks, High Complexity

Typically, legacy data warehouse technology (like the prevalent yet outdated OLAP cube) doesn’t scale properly. That’s why many organizations choose to only load aggregated data into it, meaning data gets stored in and processed by many tools. In this architecture, the solution is to set up tests throughout the pipeline in multiple steps, often spanning various technologies and stakeholders. This results in a time-consuming and costly operation.

On the other hand, using a modern cloud-based data warehouse like BigQuery, Snowflake, or Redshift, or a data lakehouse like Delta Lake, could make things much easier. These technologies not only scale storage and computing power independently but also process semi-structured data. As a result, organizations can toss their logs, database dumps, and SaaS tool extracts onto a cloud storage bucket where they sit and wait to be processed, cleaned, and tested inside the data warehouse. 

This ELT approach offers more benefits. First of all, data tests can be configured with a single tool. Second, it provides you the liberty of embedding data tests in the processed code or configuring them in the orchestration tool. Finally, because of this high degree of centralization of data tests, they can be set up in a declarative manner. When upstream changes occur, you don’t need to go through swaths of code to find the right place to implement new tests. On the contrary, it’s done by adding a line in a configuration file.

Data Testing Tools

There are many ways to set up data tests. A homebrew solution would be to set up exception handling or assertions that check the data for certain properties. However, this isn’t standardized or resilient.

That’s why many vendors have come up with scalable solutions, including dbt, Great Expectations, Soda, and Deequ. A brief overview:

  • When you manage a modern data stack, there’s a good chance you’re also using dbt. This community darling, offered as commercial open source, has a built-in test module.
  • A popular tool for implementing tests in Python is Great Expectations. It offers four different ways of implementing out-of-the-box or custom tests. Like dbt, it has an open source and commercial offering.
  • Soda, another commercial open-source tool, comes with testing capabilities that are in line with Great Expectations’ features. The difference is that Soda is a broader data reliability engineering solution that also encompasses data monitoring.
  • When working with Spark, all your data is processed as a Spark DataFrame at some point. 
  • Deequ offers a simple way to implement tests and metrics on Spark DataFrames. The best thing is that it doesn’t have to process a whole data set when a test reruns. It caches the previous results and modifies it.

Stay tuned for part two, which will highlight data monitoring best practices.