Machine Learning at the Center of Quality Data Integration

By on

Machine learning, as we know, isn’t magic, though it may seem that way sometimes. But it can be at the center of enterprise data integration and cross-domain Master Data Management efforts that may deliver millions of dollars in cost savings and other benefits.

In the case of large organizations that have gone through multiple acquisitions, for example, application and data fragmentation unfortunately is the norm. After all, the companies weren’t trying to be in synch before the acquisition, and both entities probably had their own issues with data cleansing to begin with.

So, the lack of visibility across multiple transactional systems—each involved with a specific function that touches multiple business units which probably each have their own way of managing data—is going to cause pain, and a lot of it. Some say that it isn’t a particularly scalable or speedy approach to try to cleansing the data and integrating it across these siloes using capabilities like rules-based ETL or Master Data Management that relies on a rules-based golden record.

Data fragmentation can happen due to many other reasons, too. Uncooperative data hoarders may be a source of the problem, as well as leadership changes that affect company priorities.

With machine learning, there can be a way to quickly and accurately connect and cleanse cross-business data that varies in structure and contains many duplicative entries.

That’s the case made by Data Management system vendor Tamr. The company recently released its latest version of Unify, which uses machine learning to enable enterprises to catalog their data sources, understand relationships between them, and curate information. The end goal is to be able to prepare data for analytics and applications.

“Traditional MDM will have a rules set complemented by machine learning models, but it needs bootstrapping with lots of models, and machine learning comes in at the end,” said Mark Marinelli, Head of Product at Tamr. “We build a model that takes all of the data—say, 200 attributes about a customer —and we can incorporate it all, so success or failure [of data unification] is not based on a handful of attributes or a handful of data sources. It’s machine learning first and then rules for correction.”

Bring in the Experts

Humans have a place in Tamr’s implementation of machine learning to help with corrections when needed. Once machine learning and advanced algorithms are applied to automatically connect data sources and determine which attributes to match against them, it may come across a snag. The vast majority of work is done before humans enter the picture, but when necessary, human experts who know the data can weigh in.

“We figure out what a customer ID looks like and then figure other places that look like the customer ID,” he said. That’s suggested to the human and if the technology got it wrong, the system is alerted and learns from that.

Besides ensuring the quality of data integration, another aspect of the core technology is that it embeds what it does at the point of data entry. So if someone enters the 18th version of a customer in Salesforce, it calls out to the Tamr platform and asks whether the person would prefer to use the best master that already exists. “It’s all about the consumption and entry of data,” he said. “You can prevent bad data instead of trying to fix it all the time.”

It’s true that you can overfit data with machine learning, “but that’s where humans come in and make corrections. If one data source—ostensibly the master—is broken, it skews the model in a broken direction,” Marinelli said. “That’s the perilous aspect of machine learning and putting in really good agency is the special sauce.”  

Agile Data Mastering is the New Way

Marinelli pointed out that agile data mastering that takes a bottom-up probabilistic approach is the core function of Unify, whereas traditional MDM solutions that take top-down deterministic approaches might also provide governance, lineage, and so on. But Unify can augment a company’s existing MDM systems that perform other functions well but could use help with mastering data, he said. Tamr’s approach, he said, leads to more consistency.

Speaking of agile, that plays into Unify’s ability of speed to stand up: “With the machine learning approach first, we can give you something you can work with and start driving value from data,” he said. A waterfall approach can’t match the ability to provide more agile, incremental value creation.

“We talk about DataOps when we talk about agile incremental value creation.” Automating data integration and unification through machine learning is analogous to automating testing in DevOps.

“Anyone who has done this will never go back. From now on data integration this way, whether using our tooling or others, will be the way to quickly get what is needed out of systems and redeploy data scientists to do more things.”

It’s attacking the problem of producing cleaner data in faster time so that they can build more analytical apps.

Realizing Benefits

According to case study published by Tamr, one example of using the Unify platform to achieve cost savings is GE’s use of the product to drive hundreds of millions of dollars in savings by getting a single, clean view of its procurement data. It was suffering from the inability to generate cross-business views of its supply base and had difficulty maintaining Data Quality because of new suppliers being added to their records every day.

Using its machine learning capabilities, it was able to connect and cleanse all transactional/ERP systems across eight industrial businesses. Analysis of this data led to visibility into spending across the business, and renegotiated contracts for hundreds of millions of dollars in cost savings.

The latest version of Unify provides users with enhanced reporting on how clusters of related records (entities such as “Customers”) change over time and publishes results to their downstream systems like PowerBI or Tableau.

It also recently released its Steward product, an issue tracker that lets users create trouble tickets about data issues without interrupting their workflow.

Image used under license from

We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept