
We are at the threshold of the most significant changes in information management, data governance, and analytics since the inventions of the relational database and SQL.
Most advances over the past 30 years have been the result of Moore’s Law: faster processing, denser storage, and greater bandwidth. At the core, though, little has changed. The basic analytics architecture remains the same as it was in 1992. Source systems move data into a centralized repository (or set of repositories) that provide data to downstream data marts and consumers. Doesn’t matter if it’s a single enterprise data warehouse in the data center or a multi-technology ecosystem in the cloud. Batch or streaming. It looks the same.
Recent advances in artificial intelligence are driving real information management change.
Generative AI for data management entered the Gartner Hype Cycle for Data Management in 2023. The following year, it had moved up slightly but was still the “first” item on the Innovation Trigger. The anticipated time to Plateau was given as five to 10 years, but I don’t think it’ll take that long.
In this article, I’ll touch briefly on a couple areas where the impact of AI on information management is being seen, or where I expect to see it shortly. I’ll also discuss one important ripple effect: the democratization of information management functions.
Data Quality
This one is everywhere. Companies are discovering that poor data quality, and the poor data governance that enables its use, results in underperforming AI models. I illustrated the effect of data quality on AI model accuracy in an earlier blog post.
The recognition of the need for high-quality data to train AI models is largely driving the resurgence of interest in data quality and data governance.
Perhaps leadership didn’t know to ask the question, or simply assumed that their company’s data was clean – or at least clean enough to use for this shiny new AI stuff. After all, the company runs on that data. Product is moving and money is flowing. Perhaps leadership suspected that the data had problems but didn’t want to know about it. Plausible deniability. Again, the company is running fine. Don’t rock the boat. The development teams are busy enough already. But whether the ignorance was accidental or intentional, the spotlight is now on the data. Expectations of data correctness are greater today than ever before, and will continue to increase.
Data quality analysis requires the understanding of expected data content and the observation of actual data content. It’s only a matter of time before AI is applied to both ends of the data quality equation, but I’m not sure it’s totally necessary. At least not directly. And it’s ironic because AI is driving the overwhelming majority of the current interest in data quality. But data quality scoring, pattern identification, and anomaly detection don’t necessarily require it. Just look at what’s there. Sum and Group By. Basic statistics. You can assign the task to a summer intern. Start now if you haven’t already.
AI could be applied to cleansing, or at least recommending data content quality improvements, but the data owners will surely want to review any changes before they’re made.
Metadata Collection
Everybody knows they need to do it. Nobody likes doing it. So, nobody does it. Or at least comparatively few. And as a result, we have an epidemic of business decisions that rest upon data that nobody knows what it means or what it’s supposed to contain. It’s the primary barrier to really making your company’s data and analytics practice into a competitive differentiator. It’s the primary difference between the 80% of AI projects that underperform and the 20% that succeed.
The Holy Grail of metadata collection is extracting meaning from program code: data structures and entities, data elements, functionality, and lineage.
For me, this is one of the most potentially interesting and impactful applications of AI to information management. I’ve tried it, and it works. I loaded an old C program that had no comments but reasonably descriptive variable names into ChatGPT, and it figured out what the program was doing, the purpose of each function, and gave a description for each variable.
Eventually this capability will be used like other code analysis tools currently used by development teams as part of the CI/CD pipeline. Run one set of tools to look for code defects. Run another to extract and curate metadata. Someone will still have to review the results, but this gets us a long way there.
Another possibility is to analyze the running application to determine expected content. “That’s cheating!” you say. “You’re just looking at the application data and saying that’s the expected content.” Yes, that would be cheating. The idea, though, is to derive meaning from context. Is the data content expected or unexpected within its context? Again, someone will still have to review the results, but compared to doing nothing …
Data Modeling
Nobody at your company is more passionate about understanding the data than your data modelers. Unfortunately, too often their work products, while admired by other data modelers, are largely ignored by everyone else. But understanding the data entities and the relationships between them is part of understanding the data. Those relationships are the threads that make up the data fabric.
In many organizations, these folks are considered a luxury item and are often jettisoned or reassigned when budgets get tight. This shouldn’t have to be the case regardless, and it doesn’t have to. Resources, both old and new, can be leveraged to increase the efficiency of your existing modelers.
Nobody should have to develop a data model from scratch.
Don’t start over. Leverage resources that you already have at your disposal.
Your company almost certainly has a library of models lying around from various past initiatives. Some seen through to the finish and others abandoned partway. Start there. Company or organization-specific business knowledge will have already been integrated into them. No need to plow the same ground again.
Industry-focused models have been around for decades. Mature models for finance, transportation, telecommunications, retail, and many others can be found online or purchased. They have been developed in conjunction with a cross-section of companies within that industry, and represent something of a least common denominator, trying to be as broadly applicable as possible. They are almost always very well documented, making the necessary customization easier.
Large language models can already ingest information about the company and/or industry and spit out a data model. I asked ChatGPT to generate a logical data model for a passenger airline reservation system. In about 10 seconds it gave me a nicely formatted and documented set of entities, attributes, and relationships. It was mostly right. Mostly.
None of these resources, not even AI, will get you all the way there. Eighty percent of the way there, maybe, but not all the way. The deficiencies are apparent if you know the business and you know what you’re looking for.
Company-specific and domain-specific knowledge and context are still needed.
John Ladley and I talked about this with Laura Madsen in the Rock Bottom Data Feed podcast episode, The Fuss About Data Governance Disruption. Company and domain-specific knowledge is the “secret sauce” that differentiates organizations. Instead of having a team of less-experienced modelers with a senior modeler that reviews their work, the large language model becomes the team. Business and data professionals can focus instead on the details and idiosyncrasies of their organization and their business that they uniquely possess.
Analytics
The quality of natural language understanding has been increasing at a fairly consistent rate for many years. Recently, large language models have produced incredible improvements.
Large language models can be applied in analytics a couple different ways. The first is to generate the answer solely from the LLM. Start by ingesting your corporate information into the LLM as context. Then, ask it a question directly and it will generate an answer. Hopefully the correct answer. But would you trust the answer? Associative memories are not the most reliable for database-style lookups. Imagine ingesting all of the company’s transactions then asking for the total net revenue for a particular customer. Why would you do that? Just use a database. I have discussed this scenario before.
The other is for the large language model to generate a SQL query that retrieves the answer from a database or other repository. Here, we begin by ingesting a database structure and metadata. The LLM could be asked the same question, but in this case it generates the SQL query that interrogates the database. Maybe it’ll even run the query for you. The critical difference is that the data from which the results are produced reside in a database (or other repository), not in an associative memory. Of course, it’s also important to have the SQL statement to confirm the correctness of the LLM-generated query.
In this scenario, the LLM is a translator and interpreter, discerning what you’re asking from your prompt.
This has long been my vision for analytics interfaces. More than 20 years ago, I proposed to friends a data warehouse interface that was basically a Google search box.
I recently ran this experiment, too, ingesting a database schema into ChatGPT and asking it a question. It was able to handle straightforward queries easily, but as the requests got increasingly complicated, the resulting queries got increasingly incorrect.
Just as AI can only get your logical data models eighty percent of the way, they can only get your SQL queries that far as well. You still have to understand SQL to confirm and troubleshoot. You still need an understanding of analytical functions and AI algorithms: how to use them, when to use them, what the results mean, and how they can be misused.
The combination of natural language query and automatic code generation can also accelerate ETL development and data fabric implementation. I’ve tried this one, too, with similar results. The LLM takes you most of the way, but you still have to validate the application to carry it across the finish line.
Democratization
In the beginning, reporting and analytics required arcane data repository and mainframe programming expertise. The few employees with those skills were consolidated into an MIS department that received data requests, developed applications, produced results, and returned reports. In the 1990s and 2000s, the data warehouse democratized corporate information access by making data available in a central repository, accessible through SQL queries and tools that helped construct those queries. SQL and business objects were much easier to learn than COBOL.
Over time, as a technology matures, more and more people have access to its benefits and the barrier to entry is lowered.
That continues today. Many of the data and analytics activities that had previously required specialized training, experience, and expertise have now been democratized. Data repositories and tools continue to become more and more intuitive. More and more people can now extract value from corporate information resources.
Remember data science unicorns? Those rare individuals who were at the same time Ph.D. statisticians, domain experts, skilled communicators, and ninja application developers. About a decade ago it seemed that every company was looking for them. It seemed that every college was establishing a data science concentration, certificate, or degree program. When it became apparent that very few of those people actually exist, most companies moved toward data science teams having those skills in aggregate. Now, AI is democratizing data science even farther.
Unicorns are no longer required, and are being replaced by those with business knowledge and an understanding of the data.
As the level of user sophistication decreases, the more likely users are to misinterpret or misuse data, especially if it is not well understood. More hand-holding is also needed. A baseline level of business knowledge and resource utilization proficiency is required, but that is only a start.
What happens when complexity or novelty increases? What about when troubleshooting or fine-tuning is required? You need more skill than baseline. Oftentimes much more.
Anyone can take pictures, shoot videos, and record audio with their smart phone. Do you color correct and color grade your videos? Do you equalize and normalize your audio recordings? Maybe there’s somebody that does all of their network television audio and video production on their phone, but the difference between amateur and professional is usually obvious.
The point is that democratization doesn’t just mean eliminating jobs. The people will still be necessary. Instead, it’s about evolving roles. It’s about the people understanding the data and the business and then automating as much of the implementation as possible.
The people and the technology have complementary strengths and should be aligned to complementary roles.
Your experienced employees know your company and your business. When enhanced with AI, not replaced by it, the combination will maximize value for your organization.