Mind the Gap: The Product in Data Product Is Reliability

By on
Read more about author Mark Cooper.

Welcome to the latest edition of Mind the Gap, a monthly column exploring practical approaches for improving data understanding and data utilization (and whatever else seems interesting enough to share). Last month, we explored analytics architecture stuck in the 1990s. This month, we’ll look at the rise of the data product.

It wasn’t so long ago that data products were the next big thing, and many of us were very hopeful. After all, data products are more than just the data. They require the information management activities that we’ve been advocating for a long time. Even better, the business was driving IT to do it!

Maybe it wasn’t data curation in the way we had imagined, but it was a huge step in the right direction.

Vendors developed data product marketplace software, or added marketplace features to existing products. Consultancies offered data product implementation services. Several companies took the paradigm and ran with it, creating their own marketplace interfaces and workflows. It was at the top of everybody’s to-do list.

Until it wasn’t. 

In the wake of the pandemic, data products seem to have dropped off the radar – partly because to do data products right you have to do all the data curation stuff that most everyone has been resisting all along, and partly because data products got blown out of the water by the next, next big thing: generative AI.

But an interesting thing is now happening.

Companies are beginning to recognize that AI requires high-quality, well-understood data. 

Oftentimes this is because they’re experiencing the negative impact of low-quality or poorly understood data on their models. As a result, data products are making a comeback. I was excited to see data products introduced as a new entrant on the Innovation Trigger portion of the Gartner Hype Cycle for Data Management in 2023. After all, if somebody else can understand and certify the data for me, then I can use it confidently. This demand is the engine that propels data curation efforts. 

The data product concept has been fleshed out in recent years with definitions, reference architectures, and platforms. They consist of … actually, let’s not worry about what data products consist of. At least, not right now. That’s not the important part. Instead, let’s start where we should always start: the consumer.

(Before we continue, if you want to differentiate data products from data as a product, see here. Otherwise, don’t worry about it and carry on.)

Imagine you’re an analytical data consumer – maybe an analyst or data scientist or whatever. You have a question to answer and you need data. You want to spend your time generating insights, but too often you end up spending the overwhelming majority of your time finding, gathering, validating, and cleansing the data first. So much corrupted time.

But wasn’t that what data warehouses and data marts and data lakes and data lakehouses were for? They certainly help with some of the gathering and finding, but they don’t seem to be working for the validating and cleansing. Many appear to have given up on the problem. Validation and cleansing capabilities have been incorporated into several existing analytical tools and built into their standard workflows. The evidence suggests a pervasive lack of trust in the data. And that brings us back to data products and the reason for their existence. What is the difference between a data product and a data mart, summary, or shared table? 

From the consumers’ perspective, the key differentiator of a data product is reliability.

As a purveyor of data products, you must provide reliable data

We don’t give a second thought to doing our own data validation and profiling exercises on the data sets we are considering for our analyses. But think about how often we take reliability for granted in other areas of life. Do you ever open the box of cornflakes you just took down from the grocery store shelf to make sure that it has cornflakes in it? Of course not. That’s silly. Now, if the bags, boxes, and cans were unlabeled, you’d have to open each one to see what was in it. Eventually, you’d find the cornflakes. It’s no different with data. Your users shouldn’t have to open every bag, box, or can of data to discover what it contains.

You are the authority that vouches for the data product data so that every one of your users doesn’t have to do it themselves individually and repeatedly.

So, what do data product users expect?

They expect to be able to quickly and easily find the data that they need. This requires a well-organized and fully populated data product catalog. Data understanding is the necessary foundation for all data products, with all the usual suspects: business description and intended business use, expected content, lineage, calculations, transformations, architecture, and security, privacy, and retention requirements. The more accurate and complete this information, the faster and more confidently the users will be able to find the data that they need.

They expect the data to be accessible. Think about your data product marketplace from the users’ perspective. Talk with the data product users. You may discover that they do not approach data product consumption the same way that you would. Conduct focus groups. Make it easy for them to find what they’re looking for, to compare candidate data products (like comparing television models), and to access that data through their preferred analytics tools.

They expect the data to be correct. Your assurance that its contents are always correct is the most significant distinguishing characteristic of a data product. You provide the ongoing validation, certification, and research so that your users don’t have to. You ensure that the data product is kept current with new arriving data. You continuously monitor its data quality. In addition to content, you must also be concerned with semantics. Changes in the business as implemented in the source systems and propagated through the data may necessitate changes to the data product.

There’s a lot involved in creating and curating data products, as well as in deploying a data product marketplace. But that’s the level of service we should want to provide to our users. It’s the level of service that will accelerate artificial intelligence, machine learning, and advanced analytics delivery, and improve the quality of our models. It’s the level of service that will create a competitive differentiator for our company. It’s what our users expect from us. And it’s not a technical challenge. 

Data product development is first and foremost a mindset requiring culture and discipline.

Technology can facilitate, but technology alone is not remotely sufficient. I’ve seen the data product label slapped on data marts, summary tables, and even raw data with none of the curation or monitoring. I wonder how many of us are gaslighting our users by claiming that our data products are reliable when we don’t even know what the data is supposed to contain.

We’ll talk about reference architectures, platforms, implementation, and deployment another time, but none of those will be successful without a culture that values data understanding and the discipline to fully incorporate it into the standard development processes. If you’re interested in AI (and most everyone is), and you want to use data products to train your models (because accurate models require accurate data), then this is where you have to start.