The intersection of data lineage and Data Quality helps provide more accurate and useful information. Data Quality represents the accuracy of data. Internet businesses need good Data Quality to operate efficiently. Unfortunately, there can be obstacles in gathering, storing, and maintaining high-quality data. The use of data lineage can help eliminate those Data Quality obstacles by providing a history that leads back to the source if there is a problem with the data.
Currently, data is collected from multiple sources and in different formats, such as video, audio, and images, making the use of data lineage even more important for Data Quality.
In modern data stacks, the data is stored not only in application databases but also in various applications, as the data flows from one application to the next. It can also travel from an application database to a data warehouse, where it is transformed into a standardized data format, and then eventually shifted to other downstream applications or tools for processing the data.
The complexity of modern analytics pipelines, the massive amounts of unstructured data, and long runtimes present debugging and manageability challenges that can affect Data Quality.
While the architectural design of an internet business should support the flow of data and allow each system access to the data using the format most appropriate for it, the reformatting process can result in corrupted data files. After the data is taken from its source database, it can undergo a variety of data format transformations, resulting in an additional layer. This layer can hide or eliminate the data’s traceability. For example, after going through the reformatting process, the references for a piece of data may have changed, creating confusion as to whether the data was ever actually collected.
The reformatting of data can lead to its corruption, to an inability to even find the data file, and to missing bits of data.
Data Lineage to the Rescue
Data lineage communicates the data’s origin, what has happened to it, and its history as it moves from its source. It provides visibility and streamlines the process of tracking errors to their root cause. Data lineage can also support replaying specific portions of a data flow for purposes of regenerating lost output, or debugging.
Data lineage can be a benefit to the entire organization. It provides the visibility and context needed for the effective use of data, and allows the IT team to focus on improvements, rather than manually mapping data. The benefits allow organizations to:
- Save the IT team time
- More easily comply with regulations
- Understand and trust their data
Organizations experiencing Data Quality issues may also want to investigate Data Governance software and/or the concept of data mesh systems.
How Data Lineage Works
After being implemented, data lineage communicates the data’s path visually from source to destination. This includes various changes along the way and how its representation and parameters change. The issues may range from verifying that no personal information about customers is being shared with the wrong people to tracking down a simple, reoccurring format error.
Data can be debugged by re-running the analytics process through a debugger, but this can become expensive due to the resources and time used, slowing down research and analytics.
Differing techniques can be used when collecting and documenting data lineage information. Additionally, they are not mutually exclusive – an organization can use more than one, depending on the circumstances. Some of the basic techniques designed for data lineage are described below:
- Pattern-based lineage: This technique seeks out patterns in the metadata to create a lineage. The primary advantage in using this technique is that it does not require any knowledge of programming languages to track the data. Differences in attributes or data values indicate the data was transformed as it was copied from one system. The data transformations and data flows can then be documented as part of data lineage records.
- Lineage through data tagging: By examining the metadata, tags can be attached to data sets, which helps in describing and characterizing them for lineage purposes. Tagging can be done manually or automatically with the appropriate software.
- Lineage by parsing: Data lineage tools can be used to explore data transformation logic, data integration workflows, runtime log files, and other data processing codes for identifying and extracting lineage information. Because the data is monitored as it moves, this technique makes capturing the changes across systems fairly simple. While the parsing technique can be more accurate than the pattern-based technique, it is also a more complicated process.
- Manually implemented lineage: This technique involves interviewing business users, data scientists, BI analysts, data stewards, and others who work with the data about how it moves through various systems and is used and modified. The collected information can be used for mapping out the transformations and data flows. (This is a human process, and very slow compared to automated processes.)
Automated Data Lineage
It is not reasonable for growing businesses to manually develop data lineages on a consistent basis. If Data Quality has become an issue, and data lineage is used regularly, an automated system will save time and money.
Automated data lineage can significantly improve the traceability and transparency of data. These automated processes minimize the chance of human error when developing lineages. They also allow less technically skilled staff – not just the IT team – to trace the origins and transformations of data. Automated data lineage tools support the following:
- Collecting data comprehensively: Automated lineage tools can be used to identify data across the organization, allowing the lineages of all the data to be traced.
- Visualized data lineages: Automated tools can display data lineages through user-friendly dashboards.
- Merging Data Governance with data lineage: Integrating Data Governance tools and data lineage automated tools supports enforcing governance policies that work with the lineages.
- Collaboration: Lineage automation tools also come with features for streamlining collaboration between staff, IT teams, and management.
Trustworthy Data Pipelines
Data lineage allows staff and management within the organization to understand and trust their data pipelines. Pipelines are an important part of the data’s history. Data lineage takes place during different stages of the data pipeline’s use:
- Data collection: The data flow is tracked during the data gathering process, and is checked for errors during the data transfer, or the mapping between the source and destination systems.
- Data processing: Takes place when specific operations are performed on the data, and tracked. Each stage of data processing is analyzed separately to find any errors or security violations.
- Query history: User queries, and automated reports generated by databases, data warehouses, or similar systems are tracked. Because entirely new datasets may be created, it becomes critical to establish a data lineage for important queries and reports. (Queries differ from searches.)
The Growing Popularity of Data Lineage
Until recently, data lineage was focused primarily on tracing relationships in data lakes or warehouses. This meant using “data tables” to track relationships. Now, data lineage uses a cross-system. (Cross-system lineage maps use the data from beginning to end at “the system” level – operational systems, data warehouse systems, etc.)
Data lineage has gained increasing popularity in the last few years. Supporting Data Quality with the incredible amounts of data currently being used is one reason. Another reason is the development of data regulations worldwide with some fairly severe penalties (GDPR, CCPA, and LGPD, for example). Data lineage allows organizations to track, organize, and protect personal data closely.
Image used under license from Shutterstock.com