Connecting the Dots: Strategies for Matching Big Data

By on

Click to learn more about author Harald Smith.

Entity resolution, also known as data matching or identity resolution, is central to our ability to connect data about any given entity, whether that is a person, place, or thing (e.g. asset, part, product). For most of us, that process of connecting data together simply happens, maybe in a CRM or ERP system, maybe behind the scenes in a Master Data Management system where a “Golden Record” is generated and forwarded to other systems. For anyone focused on Data Governance, analytics, data science, machine learning, or other new business initiatives around Big Data, understanding strategies for connecting this data together is critical and can have a significant impact on success.

In my previous blog, I noted that entity resolution capability has been a foundation for most Data Quality tools since the field emerged as a Data Management discipline. Typically, this functionality is embedded into operational applications, data integration pipelines, and MDM processes. As we pump more data into our data lakes, or other downstream data stores, though, the problem of grouping like pieces of data about an entity together re-emerges, and that impacts our ability to get accurate, trusted information.

Why Big Data Has Made Data Matching More Complex

Why? In a word: volume. Our Master Data Management system is no longer the sole input to these initiatives. Data pours in from distinct lines of business, streaming data (including Internet of Things, or IoT, data from sensors and mobile devices), open data, third- party data, and more. Further, efforts focused on utilizing this data such as analytics and machine learning are often adhoc or exploratory – formalizing an entity resolution process where there may not be payback is not where we want to invest time.

Many organizations have Data Quality tools and capabilities, including data matching, but often these are buried in operational processes and may not be designed to work with the volume and variety of data present in data lakes and Big Data platforms. Consequently, analysts and data scientists working on new initiatives fall back on the available tools that they have at hand. Many of these contain rudimentary matching functions such as join or merge routines that are based on exact character matching (for example, the exact spelling of name or address). These are not sufficient to produce appropriately matched and trusted data. Instead, they produce either:

  • Under-matched entities with higher percentages of duplicates that fail to connect interesting and useful insights (including potentially fraudulent or risky activities); or
  • Over-matched entities that inappropriately connect multiple distinct people or places together (linking good customers with those with high credit risk, for instance).

To ensure we are producing high-value, high-quality, trusted results, we need to re-educate everyone downstream about how to put an effective data matching strategy in place. That means you need the Data Quality tools and processes that allow you to correctly bring your data together. It also means you must help anyone working on these initiatives to become data literate about matching and entity resolution.

Getting the Right Data Together

Whether looking at traditional data or Big Data, data matching has three fundamental steps:

  1. Grouping data for evaluation
  2. Identifying a possible match
  3. Assessing criteria that may nullify or reject the possible match

These steps may need to be repeated either at an individual or group/set level, but at its most fundamental level, that is data matching. I’ll look at the first step here in this blog and address the others subsequently.

The central premise of this step is to get the right pieces of data together. If you don’t know about the data, if you don’t have the data at hand, then you can’t connect the dots to get the right information for use.

The best analogy I’ve seen for entity resolution was put forward by Jeff Jonas and is that of assembling a puzzle where you typically spend time grouping data by characteristics (e.g. edge pieces, color); test possible pairings of pieces; and reject those that don’t fit. However, when it comes to entity resolution, it’s not a single puzzle you have, but multiple puzzles (including duplicated ones), and some of the puzzles are incomplete. Other puzzle may be located in other rooms that you do not have access to, and additional puzzle pieces are periodically brought into the room to look at, and they may or may not relate to what you already have.

Ideally, we want to group our data into small sets that are reasonable to evaluate. Consider that for a group of 1,000 records, there are ~1,0002 (1 million) comparisons we’ll need to make to assess whether any pair of records potentially match. As the number of comparisons grows exponentially, so does the time required to match data. It’s not realistic to compare everything to everything. Keeping groups smaller is a performance consideration which is why unique, real identifiers, where they exist, are so valuable as they inherently keep groups small. 

We can use data profiling to assess possible data elements to group by, looking specifically at the value distributions and potential group sizes. If a data element is unique, it’s not usable for grouping. If a data element is sparse (that is mostly empty), it might still be of use for some data, but anything without that piece of data remains unevaluated. Knowing that if we don’t ever group certain sets of data together, they will never be compared. So we look for data elements – the “blocks” or “match keys” that pull the most likely sets of data together – that are well-populated, make sense for grouping, and have distributions that divide the data into manageable sets of evaluation.

However, these data elements can be hard to find in the raw data. This is why parsing, standardization (phonetics, nicknames and so on), and data enrichment steps are often critical to apply before attempting to match data. These compensate for some of the most common data issues such as misspellings and keystroke errors, yet still allow us to bring good match groups together. As with the raw data, you can apply data profiling against the standardized data to get a view into how this added content helps maximize data comparison.

As you evaluate ways to group your data consider applying the following techniques:

  • Parsing substrings to find usable groups (email domain, area code)
  • Standardizing names of any type (personal, street, city) to phonetics and nicknames
  • Generating acronyms for business names
  • Standardizing geolocation coordinates or dates to a broader value
  • Applying data binning techniques to highly discrete values like quantities or timestamps
  • Classifying descriptive content such as product or asset data that uses similar terminology with usable but distinct codes

When you assess these new groups, continue to keep an eye out for null values or spaces, sparse data which will typically not be grouped with anything, and for any other commonality in these records that will allow you to group and subsequently evaluate the data further.

Finding Gaps, Validating Groups

Rigorous evaluation is necessary for any process that matches entity data. Similar to the process needed to assess potential match groups, we need to identify gaps in the matching strategy. This is commonly overlooked. There are two primary conditions to look for: records that were isolated for each and every match pass you evaluated and records that had null or blank values in any part of the match key for a given match pass. Business rules that filter or test for these conditions will help find these forgotten/untested records. Further analysis of the data is needed to see how the records might be included in a match process.

Matching functionality can also be used as an evaluation and quality assurance tool to test match groups and identified relationships, determine whether data is duplicated, look for inappropriately linked records, and find unexpected data correlations. For instance, you can set up a match on a data element such as a national identifier or social security number, then test for and match on a negative condition (such as names that are different). Once when I worked with a large financial institution, we found over 200,000 records with the same social security number but distinct names through this approach. It turned out that the organization had used a specific value in the social security number field to identify bankruptcies, but all the individuals included in this group were distinct. A simple upfront business process produced unexpected downstream consequences, and could have tremendous ramifications in predictive analytics, AI, and machine learning.

To connect the dots, or fit the pieces of the puzzle together, we need matching strategies that allow us to maximize the opportunity to put like data together in the same processes in order to get closer to the insights we hope to find. When you look to group data, remember to profile the data elements and understand the frequency distributions and null values not only for the raw data, but also for parsed and standardized forms that may offer better, more effective grouping.

Leave a Reply