Advertisement

How to Architect Data Quality on Snowflake

By on
Read more about author Angsuman Dutta.

Without effective and comprehensive validation, a data warehouse becomes a data swamp. 

With the accelerating adoption of Snowflake as the cloud data warehouse of choice, the need for autonomously validating data has become critical. 

While existing Data Quality solutions provide the ability to validate Snowflake data, these solutions rely on a rule-based approach that is not scalable for hundreds of data assets and are often prone to rules coverage issues. 

Current Approach and Challenges

The current focus in Snowflake data warehouse projects is on data ingestion, the process of moving data from multiple data sources (often of different formats) into a single destination. After data ingestion, data is used and analyzed by business stakeholders – which is where data errors and issues begin to surface. As a result, business confidence in the data hosted in Snowflake reduces. Our research estimates that an average of 20-30% of any analytics and reporting project in Snowflake is spent identifying and fixing data issues. In extreme cases, the project can get abandoned entirely.

Current data validation tools are designed to establish Data Quality rules for one table at a time. As a result, there are significant cost issues in implementing these solutions for hundreds of tables. A table-wise focus often leads to an incomplete set of rules or often not implementing any rules for certain tables, resulting in unmitigated risks. 

In general, data engineering teams experience the following operational challenges while integrating current data validation solutions:

  • Time it takes to analyze data and consult the subject matter experts to determine what rules need to be implemented
  • Implementation of the rules specific to each table. So, the effort is linearly proportional to the number of tables in Snowflake
  • Data needs to be moved from Snowflake to the Data Quality solution, resulting in latency as well as significant security risks
  • Existing tools come with limited audit trail capability. Generating an audit trail of the rule execution results for compliance requirements often takes time and effort from the data engineering team 
  • Maintaining the implemented rules as the data evolves 

Solution Framework

Organizations must consider data validation solutions that, at a minimum, meet the following criteria:

Machine Learning-Enabled: Solutions must leverage AI/ML to: 

  • Identify and codify the data fingerprint for detecting data errors related to Freshness, Completeness, Consistency, Conformity, Uniqueness, and Drift.
  • Effort required for establishing validation checks should not depend on the number of tables. Ideally, the data engineer or data steward should be able to establish validation checks for hundreds of tables with a single click.

In-Situ: Solutions must validate data at the source without the need to move the data to another location to avoid latency and security risks. Ideally, the solution should be powered by Snowflake for performing all the Data Quality analysis. 

Autonomous: Solution must be able to:

  • Establish validation checks autonomously when a new table is created.
  • Update existing validation checks autonomously when the underlying data within a table change. 
  • Perform validation on the incremental data as soon as the data arrives and alert relevant resources when the number of errors becomes unacceptable. 

Scalability: The solution must offer the same level of scalability as the underlying Snowflake platform used for storage and computation. 

Serverless: Solutions must provide a serverless scalable data validation engine. Ideally, the solution must be using Snowflake’s underlying capability.  

Part of the Data Validation Pipeline: The solution must be easily integrated as part of the data pipeline jobs.

Integration and Open API: Solutions must open API integration for easy integration with the enterprise scheduling, workflow, and security systems. 

Audit Trail/Visibility of Results: Solutions must provide an easy-to-navigate audit trail of the validation test results. 

Business Stakeholder Control: Solutions must provide business stakeholders full control of the auto-discovered implemented rules. Business stakeholders should be able to add/modify/deactivate rules without involving data engineers.  

Conclusion

Data is the most valuable asset for modern organizations. Current approaches for validating data, in particular Snowflake, are full of operational challenges leading to trust deficiency and costly, time-consuming methods for fixing data errors. There is an urgent need to adopt a standardized autonomous approach for validating the Snowflake data to prevent the data warehouse from becoming a data swamp. 

Leave a Reply