Click to learn more about author Einat Orr.
Data engineering is the science and art of
producing good and timely data. Its goal is to deliver data to users even more than to deliver applications. There are great methods and tools that help deliver applications
with consistently high quality. What are the methods and tools that help us
deliver high-quality data?
In this article, I will take the three following concepts: development environments, continuous integration, and continuous deployment, and demonstrate what they should look like in the world of data delivery. I will also provide examples of tools that can help you build this foundation for your data application.
What Is a Development Environment for Data?
When developing data-intensive applications, we need to experiment with new code, new data sets, changes to the code, or data changes to data analysis tools — for example, new ETL, format or schema changes, a new compression algorithm, accuracy, a Spark/Presto version upgrade, and so on.
While the type of experiment varies, the need remains the same: We should be able to run isolated experiments of data pipelines in an environment that is similar to our production environments without the fear of compromising it.
Let’s assume we had the ability to manage our data lake the way we manage our code repository. Version control for data allows Git-like operations over big data repositories and enables the branching, commits, merges, and hooks. Tools that provide this capability are DVC, which is focused on ML pipelines use cases and designed for human scale, and lakeFS, which is a general purpose version control layer that supports machine scale. It is important to make sure these operations are done in a cost-effective way and avoid copying data. All Git-like actions should be metadata operations, and hence immediate and atomic.
Creating a development environment is easy and can prevent costly mistakes in production. By creating a branch of our production data, we get an isolated data environment representing a snapshot of our repository. Changes made to the master branch after the branch was created are not visible within the branch unless we explicitly merge them in.
While working on our branch in isolation, our changes are not visible to all other users who are working on the repository’s master branch. To sum up, a branch provides us with our own private data lake to experiment on.
an experiment of upgrading a version of Apache Spark. For this purpose, we
create a branch that will only be used to test the Spark upgrade and discarded
later. Jobs may run smoothly (the theoretical possibility exists!), or they may
fail halfway through, leaving us with some intermediate partitions, data, and
metadata. In this case, we can simply revert
the branch to its original state, without worrying about the intermediate
results of our last experiment, and perform another (hopefully successful) test
on the isolated branch. Assuming revert actions are atomic and immediate, no
manual cleanup is required.
Once testing is completed and we have achieved the desired result, we can delete this experimental branch, and all of our data changes will be gone.
What Is Continuous Integration of Data?
When introducing new data sets to the data lake, we must verify they adhere to the engineering and quality requirements we expect, such as format, schema, data range, PII governance, etc. In applications where consuming new data is a standard matter, integrating new data to our data lake continuously is a basic need, like we integrate new code to our codebase. Continuous data integration is the automatic and safe ingestion of data into our data lake while ensuring we meet Data Quality requirements.
On top of the Git-like operations, we also have testing frameworks for datathat allow us to easily define metadata validations and Data Quality tests. In this category, you can find startups like Monte Carlo, Great Expectations, and Mona that perform tests on data in different phases of the data application life cycle. Tests may include engineering best practices, such as schema or format, or sophisticated ML-based tests to find anomalies from the so far observed behavior of your data.
A good practice would be to ingest the data to an isolated branch without our consumers being aware of it.
We now define a set of pre-merge hooks that trigger our data validation tests. Only after the tests have passed, the data will be merged to the lake’s master branch and exposed to consumers. If a test fails, we alert the writer with the relevant validation test failure. In this manner, we achieve high-quality ingestion of data with atomic, Git-like operations, freeing us from worrying about the leftover state to clean up.
What Is Continuous Deployment of Data?
In data production environments, you have data streaming in, as well as time-based orchestrated jobs running, and existing data sets are updated with the freshest data. Even when code and environment don’t change, the data is dynamic and is constantly changing. New data representing the present is fueling the application and enables the delivery of up-to-date insights. Data is continuously deployed into production, so a continuous deployment of data to production is a process that allows data validation and quality assurance before the deployment of the data to production, where it is consumed by internal or external customers.
Let’s assume, on top of the Git-like operations and testing frameworks for data, we have orchestration tools that allow us to automate the execution of complex data operations that require executing a graph of small data analysis jobs. It is key in the continuous deployment of data to production. Check out Apache Airflow, Luigi, Dagster, and Prefect.
Now running continuous deployment for data is easy:
- Instantly revert changes to data: If low-quality data is exposed to our consumers, we can revert instantly to a former, consistent, and correct snapshot of our data lake. By making commit history available for a configurable duration — we can roll back the lake to the previous version instantly with one atomic action.
- Prevent Data Quality
issues by enabling:
- Testing of production data in the isolation of a branch before delivering it to production users/consumers using hooks in the orchestration system that is running its pipeline on the isolated branch.
- Testing intermediate results in isolation in our DAG to avoid cascading quality issues and easily manage the retention of such intermediate results using branching retention logic.
- Enforce cross-collection consistency: Provide to consumers several collections of data that must be synchronized in one atomic, revertible action. Using branches, writers can provide consistency guarantees across different logical collections — merging to the main branch is only done after all relevant datasets have been created successfully.
We have all the tools we need to make our big data environment resilient and to manage new data and day-to-day production data while ensuring high-quality delivery. It’s up to us to be open to new methodologies and supporting technologies and set up a world-class modern data infrastructure.