Article icon
Article

Mind the Gap: Data Trust but Data Verify

Mark Cooper headshot

We seem to have an irresistible reflex to prepend “data” to every noun, verb, and adjective in the dictionary. In this column I’m going to talk about two more: data trust and data contract.

I predict that through 2028, 80% of S&P 1200 organizations will become curious about the meaning of data trust. Why? Because among The Gartner 100 Data, Analytics and AI Predictions Through 2031 was:

Through 2028, 80% of S&P 1200 organizations will relaunch a modern, data & analytics (D&A) governance program, based around a trust model.

That’s a pretty astonishing prediction. Let’s break it down.

Through 2028. Probably not coincidentally, that’s the same year by which Gartner expects both significant adoption of generative AI and AI agents as well as concomitant AI-related risks and mishaps. The relationship between data quality and AI quality is already well known, but the support for data quality initiatives is still too often half-hearted. Within three years, companies will either be enjoying the benefits of their quality efforts, watching others benefit from their quality efforts, or recovering from some misfortune as a result of lacking quality efforts.

80% of S&P 1200 organizations. That’s a lot. The broadest of the predictions presented. I guess they can’t say “everyone,” but that’s pretty much everyone. This probably means you.

Will relaunch a modern, D&A governance program. Not launch. Relaunch. It’s an interesting choice of words. It’s not that there aren’t existing D&A governance programs. It’s not even that there haven’t been successes with existing D&A governance programs. To the contrary, there have been lots. It’s just that what we’re doing is working just well enough to give management the illusion of adequacy. It will become clear that these efforts will not be sufficient to support AI. Hence, “relaunch.”

Based around a trust model. This part caught my attention. Since it seems that most of us are going to be doing it within the next couple of years, it would probably be useful to take a look at trust models in more detail.

A trust model is a framework for establishing confidence in a relationship.

It’s a very common approach used to diagnose and strengthen trust dynamics between leaders and their teams. It seems that every management consultant and consultancy have its own model: the Trust Equation, ABC Model, Nine Habits of Trust, 3 Cs of Trust, and many, many more. (Maybe it’s just the era when I grew up, but to me that list looks like the tracks on Side 1 of a Sesame Street album.) Generally speaking, they emphasize variations and combinations of competency, integrity, reliability, empathy, communication, and accountability.

Trust models have also been applied to cryptography and computer security. Frameworks include Public Key Infrastructure and Zero Trust. Their purpose is to help protect sensitive data and systems from cyber threats. The idea is to never trust. Always verify. Deny access by default. Actually, these seem more like “lack of trust” models.

I talked about the Zero Trust concept as applied to data several months ago. A Zero Trust Data Content strategy assumes that in the absence of sufficient evidence to the contrary, data is always assumed to be incorrect.

Here, it is being applied more broadly to data and analytics, not just data content.

From a D&A perspective, data trust means having confidence in the accuracy, reliability, timeliness, and security of your data.

I’m not sure that qualifies as revolutionary. I’m not sure that it even requires new vocabulary. I guess it says something about the state of our art that it appears to be necessary. We’ve been kind of underwhelming when it comes to data governance and data quality. When it comes to data trust, maybe a reset is necessary.

At this point most companies understand that the success of AI initiatives strongly depends upon the quality of the model training data. Understanding is a great first step, but what about the doing? That’s the hard part. That’s the part that requires that something different be done.

Just as there are many trust model frameworks for management coaching and for data protection, there are many Trust Model frameworks for data and analytics. It feels like this in so many areas right now. Every vendor (existing and new) is trying to get something out into the market as quickly as possible, hoping that it will become the standard (or at least be used enough to justify their investment in it).

The framework from decube.io is typical. The key to trust is integration and transparency, and it rests on four pillars:

  • Metadata Management
  • Data Governance
  • Data Quality with Observability
  • Data Mesh

Look familiar? You don’t have to squint very hard to see data products. After all, data mesh rests on a foundation of data products. But I’ll go one step farther:

Pursuing data products is the fastest, most reliable, and most practical way to increase data trust.

It doesn’t matter whether it’s part of a data mesh (or data fabric) architecture.

It shouldn’t be a surprise. What’s the “product” in data product? If you’ve been following along here the last 18 months, say it together: “The ‘product’ in data product is reliability.” Reliability builds trust.

Data trust is an outcome, not a program or initiative.

I suspect that Gartner was using the term in their prediction more as an umbrella that encompasses relaunched data governance efforts having components similar to the four listed above. Increased trust in the data is the measurable result.

Sometimes I think our vocabulary choices are more aspirational than functional.

We want to build data trust. We can take steps to build data trust. We can measure the level of trust in the data. We certainly don’t want to lose trust because when we do it’s really, really hard to get back. So, yes, data trust probably a useful metric. It’s sort of like consumer confidence.

So, if using a different wrapper helps to sell the concept to management, then fine. Activities under the heading of “data trust” are those things that you know you should be doing anyway. Don’t get distracted by the hype and the buzz, but instead use the hype and the buzz as drivers to do them.

Which brings us to data contracts.

My first thought, when I heard the term, was that we’re creating vocabulary to describe something that should already be being done like it’s something new.

Perhaps. Or perhaps not.

If you’re considering implementing a data fabric, especially using data products, it won’t be too long before you run across data contracts.

Data contracts are agreements between data producers and data consumers that specify data structure, semantics, responsibilities, and service-level commitments.

We’ve certainly seen something like this before. After all, interface agreement is required for any automated data exchange, whether internally within a company or externally between companies. Standard APIs or specifications are used in Finance (SWIFT), Healthcare (FHIR), and Retail (EDI) just to name a few.

Sounds like we’re just repackaging again.

Let’s look more closely at what a data contract contains. Most of these are self-evident. A data contract includes:

  • Metadata for the dataset itself
  • Schema definition
  • Metadata for the data elements
  • Operational expectations / SLAs
  • Quality and validation rules
  • Access and security controls
  • Consumer obligations

Again, do these look familiar?

My second thought was that this looks an awful lot like the key requirements for a data product: data, metadata, lifecycle management, and support.

That would explain why data contracts are often used to accumulate the details required for creating, executing, maintaining, and supporting data products.

The information in a data contract is captured in a structured and easily interpreted format, although there’s not yet widespread agreement on what that structured and easily interpreted format should be. Some use JSON. Others YAML or XML. And still others use a variety of lesser-known or proprietary formats. This lack of widespread agreement turns out to be a recurring challenge when it comes to Data Contracts.

I’ve long said that metadata should flow from source to target like the data itself. Define the descriptions, expected content, and so forth once at the System of Record and propagate them to wherever that data is used. Don’t create everything from scratch at each destination.

Data contracts enable that vision.

The schema, content, support, maintenance, SLAs, etc. can be validated against that commitment. And much of that validation can be automated.

Not everything can be automated, though. Like any “documentation,” using data contracts requires discipline. They require ongoing maintenance. It’s the responsibility of the data producer to ensure their accuracy. A disconnected, abandoned, or neglected data contract is no better than the documentation that we haven’t been producing for decades. Worse, really, since the use of that obsolete information will inevitably lead to application errors, AI hallucinations, faulty decisions, and will ultimately undermine trust in the data. Again, when trust is lost, it’s extraordinarily difficult to get back.

Today, data contracts are most often used to support data products and the data mesh and data fabric architectures built from them.

A data contract can be used as the template for all of the details that need to be collected, defined, specified, and aligned for data products to be used in production.

Another common current use is to support and automate data quality and compliance. The information stored in the data contracts can drive data quality tools. Assessing completeness can be incorporated into the CI/CD pipeline.

Unfortunately, tools and frameworks that create, publish, and validate data contracts are proprietary, in development, or vaporware. Some practitioners and early adopters have advocated for dbt as the data contract architecture. It stands for Data Build Tool, even though the abbreviation is not capitalized. It is an open-source command line tool that streamlines data transformation and modeling within data warehouses, and looks like a combination of SQL and a templating language. It’s been around since 2016 and the current version was released in 2021.

With such a new technology, and the overwhelming attention directed toward AI, vendors have been hesitant to invest in developing GUI data contract management systems. Nevertheless, a few tools are starting to appear, but you don’t have to buy a tool yet. Just gathering the information required for the data contracts is an excellent, and necessary, first step.

As distributed data and analytics architectures become more widely accepted and implemented, data contracts are likely to become a key component. They are the glue that holds these architectures together, facilitating data mesh and data fabric, and increasing data trust. The data contract formats and tools will surely mature in the near future.

In the meantime, let’s keep an eye on this one.

Data Governance Bootcamp

Learn techniques to implement and manage a data governance program.