Data can be anywhere. Companies store data in the cloud, in data warehouses, in data lakes, on old mainframes, in applications, on drives — even on paper spreadsheets. Every day we create 2.5 quintillion bytes of data, and there are no signs of this slowing down anytime soon. With so much available for data-driven decisions, you’d think every company would be relying on solid analytics to compete in the marketplace. In practice, however, one in three business leaders don’t trust the quality of the data they use to make decisions, and bad data costs the U.S. economy $3.1 trillion per year, according to Extracting Business Value from the 4 V’s of Big Data.
Rules-based Internal Data vs. Active External Data
The DAMA International Data Management Body of Knowledge, defines “high quality data” as that which is “reliable and trustworthy,” so how can companies improve and maintain the quality of their data? Not all data requires the same amount of effort to maintain.
Bud Walker, Vice President of Sales & Strategy for Melissa, divides data into two broad categories based on the level of effort it requires to stay current. Internal, ‘rules-based’ data changes less often and requires more internal subject matter knowledge. Active, external data is constantly changing.
Examples of rules-based, internal data:
- Employee performance data
- Supplier payment terms
- Product information
Examples of active external data:
- Customer data
- Email addresses
- Job titles
- Company names
Dimensions of Data Quality
In his white paper, The Build versus Buy Challenge, Walker breaks down Data Quality into six dimensions:
- Completeness: Are all the pertinent fields filled?
- Validity: Do all the values conform? Are street address fields in the right order and properly spelled?
- Accuracy: Does the data reflect a real-world person or object? Mickey Mouse is probably not a real sales prospect.
- Consistency: Does the data align with understood patterns? DOB, for example, is formatted MM/DD/YYYY in the U.S., but is different in international markets.
- Uniqueness: Are there duplicate instances?
- Timeliness: Is it up-to-date? Twenty-five percent of marketing data goes stale within a year due to email, phone number, and household changes.
How Does Data Go Awry?
According to the DAMA DMBoK2, bad data comes with the territory:
“Because no organization has perfect business processes, perfect technical processes, or perfect data management processes, all organizations experience problems related to the quality of their data.”
The entry process is fraught with opportunities for creation of bad data, such as missing or extra fields, uppercase/lowercase, or transliteration. Paper records, spreadsheets, and other forms of data can be misread or in a different format.
The e-commerce explosion of worldwide product delivery is shining a spotlight on the growing need for better Data Quality. Walker said:
“There’s no rules-based engine that’s going to tell you if that two-hundred-dollar product you’re shipping to somebody is going to get there or not. You need to know before you take the risk and send it.”
According to Walker, an understanding of the following basics is a start:
- What constitutes a proper address or customer record?
- What must be there?
- What should not be there?
- What should fields contain? Text? Numbers? Characters?
- Does the information from your various sources fit neatly into each field? What if it doesn’t?
- Is it correct?
- Is it current?
Internal Data Quality is maintained by comparing data to a set of internal standards and cleaning or updating as necessary. External Data Quality is far more complicated. Beyond ensuring that supplier payment terms are accurate and product information is complete and properly spelled, for example, external data must be verified.
Verification answers a different set of questions:
- If the address is a valid one, does the person or company actually reside at that address?
- What language is it in?
- Does the currency match the country?
- Is the purchaser old enough to make that type of purchase, e.g., an insurance policy?
- Duplicates with variations — are they the same person? Same address?
International Verification Challenges
When Melissa began its international expansion, Walker said they discovered that there were many issues internationally with Data Quality. Only recently has it been possible to validate identity and addresses in rural, unverified places — and international data has many potential points for errors to occur. Language differences, data formats, address format, letterforms, and field matching are all possible sources of bad data.
Each country’s unique data format must be matched, field-by-field, in order to ensure that data is correct. Japanese text, for example, has three distinct scripts as well as transliterations using the Roman alphabet; Cyrillic text, which is used in Eastern Europe, Central and North Asia as well as the Caucasus, is particularly challenging.
There are tools that can be used to do field matching with algorithms, such as Levenshtein or Jaro-Winkler, based on the distance between expected English vowels, but they can’t be used for non-English or non-Latin-based languages, such as Cyrillic. “People think they can just transliterate it and then work on the strings, but it doesn’t go back the other way,” he said.
Name Verification Challenges
Personal names can be a source of duplicate records. Which nicknames or bynames are common in different countries? An American customer named “Elizabeth Smith” at the same address as “Betty Smith” could be the same person, but could a Ukrainian customer named “Nyusha Tsvetaeva” be the same person as “Anna Tsvetaeva”? If her name were written, “Нюша Цвета́ева,” and not transliterated into a Latin alphabet, how likely would it be for a U.S. company to know that all three were the same person?
Address Verification Challenges
Address verification on an international basis is a particularly difficult problem, Walker said. In India, the process of verification involved Melissa working with the national government, with Google, and with all the mapping companies in India to import, conflate, and build a dataset that includes not just major cities, but smaller cities as well.
“We have thirty people on the ground in India doing nothing but calling data,” he said. Street checking of some components of a customer’s identity profile has always been possible with Melissa’s tooling, but in the last two years, as they talked with governments, telecommunications companies, and other customer data sources, they started to realize that a full entity verification could be performed.
Not all countries have a required, government-sanctioned identification system. India, for example, has a unique 12-digit identity number called “Aadhaar” that some residents of India can obtain, but the program is voluntary, there are fees involved, and the government does not officially sanction its use for identification.
Walker said they discovered that in order to get LP gas for cooking, residents must provide a verified address and identification, as do residents purchasing a mobile phone, so Melissa has been working with the gas and phone companies in India for identity and address verification. That has made it possible to take a national ID, match it to a person, match it to a verified address, and crosslink it.
Development of Data Quality Wheel
Melissa used the process of working with verification issues in India as a springboard to develop a matching engine, a profiling engine, and a generalized cleansing engine, and to expand into areas beyond customer data.
“We developed the full wheel of Data Quality to accommodate what we were seeing. We really learned from doing.”
Because they were the first to tackle matching on an international scale, and the only company to be able to match in Cyrillic, big companies doing business worldwide approached them for help solving problems with their massive data stores.
“We went through their catalog of data issues and patterned our Data Quality tools to solve those real-world problems. They had issues with addresses, with businesses, with phone numbers — all across the board,” he said.
Data Quality: It’s Not Sexy
Not all companies are able to see the value of investments in Data Quality, and Walker said that customers often have a hard time seeing the benefits of a Data Quality regimen.Kevin W. McCarthy, in Data Quality’s Image Problem, likens Data Quality to the boring process of yard work that must be done before the backyard party can take place:
“How do you convince stakeholders that Data Quality is not an ‘IT-only’ bane of record-level tedium, but a business imperative and facilitator for a wide variety of impactful data projects throughout the organization?”
Like yard work, the right tools can make Data Quality much easier, McCarthy said, such as when he switched from a push mower to a riding lawnmower. Similarly, Data Quality tools that are built for the wider business, with machine learning and intuitive workflows, make the job much easier for a larger group of stakeholders.
“Data Quality may not get the glory, but there is no denying that it has a dramatic, positive impact on your Data Management projects,” said McCarthy.
Reference Data is the Key
Walker said he sees some people trying to reinvent the wheel by writing their own name or address parsers, but they don’t have the knowledge. “It takes forever, and then they realize six months or a year has been wasted and they haven’t made any progress.”
In a webinar series Walker did called “Build versus Buy,” he helped companies determine whether the DIY path is really the best choice. “The demarcation point is whether or not it’s reference-data driven.”
According to the DMBOK 2, many people assume that reference data is simply codes and descriptions, however, much reference data is more complicated than that. For example, a ZIP code data set will usually include information on state and county as well as other geo-political attributes.
Walker uses a five-digit postal code, 10233, to illustrate. Because it’s five digits, other engines will say it fits the pattern. “But we go in and we actually evaluate the postal codes and we can tell you that 10233 is not a valid postal code, and you won’t find that if you don’t have the reference data to back it up.” There’s a very clear distinction between creating a lexicon based on a company’s own internal knowledge, Walker said, and being able to bring that external reference data in from authorized sources.
Image used under license from Shutterstock.com