Crossing the Data Divide: AI-Human Partnership in Data Management

Crossing the Data Divide is a TDAN column published every quarter.

A Word About My Perspective

I am not a newcomer to master data management or data matching. I worked at Initiate Systems, which was a pioneer and market leader in reference-style MDM probabilistic matching, survivorship, and identity resolution at enterprise scale. Initiate was eventually acquired by IBM and became the foundation of IBM’s MDM product line.

When I describe my experience with this project, I am describing it from the perspective of someone who knows what good matching, survivorship, and stewardship looks like and how much effort they traditionally require to build well.

The Problem: 49,000 Items, 4 Million Records, and an Excel Spreadsheet

The setting is a large commercial distribution company that grows by acquisition. Over 60 companies have been acquired in the past five years, each bringing its own item master with its own vendor part numbers, item descriptions, units of measure, and data quality problems. When a new company is acquired, its item master must be matched against the existing mastered data to determine which items already exist and which are new.

The existing tool for this job was an Excel workbook called AutoMatch. It ran six passes of exact VLOOKUP matching against a flat extract. No fuzzy logic. No algorithmic matching. No cluster resolution. No confidence scoring. When a vendor part number existed on 20 rows in the extract, AutoMatch returned whichever row VLOOKUP happened to find first. When descriptions varied across branches – “4lb KRAFT GROCERY BAG 18404” versus “GROCERY BAG KRAFT 4LB 500/BD” for the same physical product – AutoMatch saw two different items.

The result was predictable: Match rates around 70% on a good day, no way to determine which cluster a matched item belonged to, and extensive manual review that consumed weeks of a steward’s time per acquisition. AutoMatch needed to be replaced. The question was how, and how fast.

Live Online Course: Data Management Fundamentals

Gain a comprehensive foundation in data management and prepare for CDMP certification – July 28-30, 2026.

Enroll Today

Solution: A Partnership, Not a Prompt

I want to be precise about this because the details matter. What followed was not a case of typing “build me a matching engine” into a chatbot and receiving a finished product. It was an iterative, exploratory, working partnership between an AI assistant and me, conducted over multiple extended sessions in a conversational interface.

I brought domain expertise: knowledge of how MDM clusters work, what survivorship rules mean in an acquisition context, how the vendor part numbers behave across acquired systems, what data quality traps to expect in files that have passed through Excel, and what the downstream business process needs from the output. AI brought speed: the ability to write production PySpark code on the fly, to propose matching strategies and immediately implement them, to spot patterns in data quality issues and suggest cleaning approaches, and to refactor code when the exploratory work got messy.

The process was genuinely collaborative. I would examine the results at each stage, challenge the AI’s suggestions, redirect the approach when my domain knowledge indicated a better path, and made judgment calls that required business context that the AI assistant lacked. The AI would generate code, explain tradeoffs, raise concerns about data quality issues, it spotted, and produce clean rewrites when the iterative exploration created messy notebooks.

A Few Concrete Examples Illustrate How This Worked in Practice

Early in the process, the AI proposed matching source vendor part numbers against the MDM’s item_code field. This produced matches, but I recognized that matching against the MDM’s cleaned VPN field would be stronger because it represented the standardized version of the vendor part number. The AI had no way to know this. It required understanding how the MDM system had been designed and what each field represented. I redirected, and the match rate jumped immediately.

Later, examining unmatched records together, the AI noticed that vendor names in the acquisition file carried location suffixes like “(CT)” and “(MR)” was killing the fuzzy vendor similarity scores. I confirmed this was a systematic pattern in the source data and directed a cleaning pass. The AI then spotted that some VPN fields contained multiple values separated by “or” and that vendor names contained multiple entities separated by slashes. Both required splitting and exploding before matching. These were data quality traps that neither of us would have caught as quickly alone. The AI assistant saw the patterns in the data; I validated whether those patterns were meaningful.

When the AI proposed description-based fuzzy matching as a third pass, I asked to see what the unmatched descriptions actually looked like compared to their potential matches in the MDM. The results showed descriptions in different languages and completely unrelated products from the same vendor. I made the call: Pure description matching would produce too many false positives. Instead, we developed a vendor-scoped description matching strategy with token-based comparison and dimension conflict detection; a more nuanced approach that neither of us would have designed as efficiently without the other.

This is what partnership looks like. Not one side driving and the other executing. Both contribute different capabilities to a shared problem.

What It Produced

The end result was an eight-pass matching engine running in Microsoft Fabric, with full survivorship logic, a stewardship review workflow, and a finalization process that produces definitive disposition for every acquired item. The engine was validated against a real acquisition file of 49,199 items matched against over four million mastered records.

The numbers tell the story. AutoMatch achieved approximately a 70% match rate with no cluster resolution and no confidence scoring. The new process achieved a 99.8% match rate with correct cluster resolution, tiered confidence scoring, and automated stewardship routing. On a validated subset of 11,219 items processed by both systems, AutoMatch matched 7,758 items; the new engine matched 11,202. Zero regressions, every item AutoMatch found, the new engine also found. Among items matched by both, 17.7% resolved to different clusters, and investigation confirmed the new engine was selecting the correct cluster through its survivorship ranking while AutoMatch was returning effectively random results.

The engine includes five data quality guards, vendor name cleaning with corporate suffix removal and location code stripping, VPN cleaning with dash removal and leading zero normalization, multi-value field explosion, and a vendor crosswalk for description-based matching. The survivorship process looks up all records in a matched cluster, applies tiered selection rules to find the best operational record, and surfaces both golden record attributes and operational fields in the output. The stewardship routing auto-accepts high-confidence matches and queues lower-confidence ones for human review with side-by-side comparison.

Having built matching engines, I can tell you: This is not a prototype. This is production tooling, documented, tested against real data and designed to be rerun for each new acquisition.

AI Governance Webinars

Check out a free upcoming episode on challenges, trends, best practices, use cases, and more.

Explore Webinars

The Effort Question

Data leaders think in terms of effort, timelines, and headcount. So, I will address the question directly: What would this have taken without the AI partnership?

Building an eight-pass fuzzy matching engine with the data quality guards, cleaning logic, survivorship rules, stewardship workflow, and described above would traditionally require a team of two to three people – a senior data engineer who understands PySpark and the platform, a data quality analyst or MDM specialist who understands matching algorithms and survivorship logic, and ideally someone with experience in record linkage and entity resolution. The calendar time would be eight to 12 weeks for a production-quality implementation, including iterative testing against real data, threshold tuning, false positive analysis, and documentation.

Working in partnership with the AI, I produced a comparable result over a compressed timeframe. The matching logic itself was built, tested, and refined iteratively across several working sessions. The survivorship rules, stewardship workflow, and finalization process followed. Thorough documentation was produced concurrently because the AI could generate it from the working context without me having to write it from scratch after the fact.

I estimate the traditional effort at roughly 700 to 900 person-hours for the three-person team I described.

The AI-partnered effort took 26 hours. It was a fraction of the time because AI eliminated the mechanical bottleneck: writing code, refactoring messy notebooks, generating documentation, and implementing each iteration fast enough that I could stay in the flow of problem-solving rather than context-switching between thinking and typing.

What This Is Not

This is clearly not AI running data management. AI suggested, but did not decide what matching strategy to use. It did not determine the survivorship rules. It did not know that the MDM’s cleaned VPN field was a better match target than the raw item code. It did not understand why description-only matching would produce false positives in a multilingual master. It did not make the judgment call to reject a matching pass that looked good on paper, but would not survive contact with real data. All of those decisions required someone who understood the domain, the systems, and the business context.

This is also not a case of AI producing something a competent team could not have built. Every technique used, Levenshtein distance, token-based comparison, vendor crosswalks, and tiered confidence scoring, is well-established in the record linkage literature. A good data engineering team would arrive at a similar design. The difference is speed. The partnership compressed what would have been months of work into a dramatically shorter timeline, and it allowed me to operate at the capacity of a small team.

The Broader Implication for Data Leaders

Here is what I think data executives should take from this experience.

The conversation in our industry has been dominated by the question of when will AI be ready to run data management autonomously. That question matters, and the answer is “sooner than you think.” But it has eclipsed a more immediately actionable question: What can AI do for your data management teams right now, today, with the tools and skills you already have?

The answer is: a great deal. And it applies across the full spectrum of data management disciplines, not just the example I have described here. Data quality profiling and rule development. Data integration and transformation logic. Metadata management and documentation. Reference data standardization. Data governance policy implementation. Migration planning and execution. Every one of these disciplines involves a combination of domain judgment and mechanical implementation. AI today is exceptional at accelerating the mechanical side and partnering with humans for their experience and judgment.

The strategic implication is that leaders should not wait for the fully autonomous future before investing in AI-augmented data management. The partnership model delivers value today. It does not require new platforms, new vendors, or organizational transformation. It requires giving your existing data professionals access to AI tools and the time to learn how to work with them effectively. The learning curve is real, but it is not steep, and the productivity gain is immediate.

If you have a backlog of data quality work that never gets done because the team is too thin, AI partnership is the lever. If you have matching and integration projects that keep getting deferred because the estimated effort is too high, AI partnership changes the math. If you have a single data engineer who understands the platform, but cannot build everything the business needs, AI partnership gives that person the throughput of a small team.

The Actionable Takeaway

Don’t wait for AI to be ready to run data management. Start empowering your teams to do the traditional work better and faster with AI as a partner. The traditional disciplines have not changed. The speed at which a skilled professional can execute them has.

The future of autonomous AI in data management is coming. The partnership model is how you get there. And it delivers enormous value along the way.

Data Architecture Workshop

Learn how to design unified, future-ready data architectures that bring together operational, analytical, and AI data – December 1-2, 2026.

Enroll Now

Crossing the Data Divide: AI-Human Partnership in Data Management

A Word About My Perspective

The Problem: 49,000 Items, 4 Million Records, and an Excel Spreadsheet

Live Online Course: Data Management Fundamentals

Solution: A Partnership, Not a Prompt

A Few Concrete Examples Illustrate How This Worked in Practice

What It Produced

AI Governance Webinars

The Effort Question

What This Is Not

The Broader Implication for Data Leaders

The Actionable Takeaway

Data Architecture Workshop

John Wills

Beyond the Stack: The New Skills of Effective Technology Leaders

Why Your Semantic Layer Will Make or Break Your AI Strategy

Ask a Data Ethicist: Could a Machine Be a Person?

Thanks!

Crossing the Data Divide: AI-Human Partnership in Data Management

A Word About My Perspective

The Problem: 49,000 Items, 4 Million Records, and an Excel Spreadsheet

Live Online Course: Data Management Fundamentals

Solution: A Partnership, Not a Prompt

A Few Concrete Examples Illustrate How This Worked in Practice

What It Produced

AI Governance Webinars

The Effort Question

What This Is Not

The Broader Implication for Data Leaders

The Actionable Takeaway

Data Architecture Workshop

John Wills

Related Articles

Beyond the Stack: The New Skills of Effective Technology Leaders

Why Your Semantic Layer Will Make or Break Your AI Strategy

Ask a Data Ethicist: Could a Machine Be a Person?

Lead the Data Revolution from Your Inbox.

Thanks!