A Fable with a moral
Once upon a time, there was a company. This company had a terrible time recognizing their customers and understanding what products those customers had. The reasons for this weren't hard to understand. The company's main business was selling insurance, and each type of insurance was sold using a different computer system. That is, if you purchased an automobile policy, a homeowner's policy, a personal liability policy, and a road rescue policy, your name and pertinent information would appear in four different systems. Each of these systems required only the data necessary to sell that product, and was completely unaware of any overriding needs of the Enterprise to identify the customers. That is, the systems were completely product-centric. For example, while the auto policy system required a driver license and a birthdate for the policyholder, the homeowner policy system did not. Actually, the homeowner system allowed for a birthdate, but typically this was filled in using the birthdate of the oldest person in the household (not necessarily the policyholder) to trigger the senior discount. Each system also had its own rules about how data was stored, and what validations were done on that data. For example, the homeowner policy system would trigger an error if multiple active policies were written against a specific address. But there was no address validation, so the address had to be an exact match. Change a single abbreviation (e.g., from "Lane" to "Ln") and the validation went through just fine. The auto policy did require a driver's license number, but did not check to see if if was unique (or nearly so). Thus, the same driver license number could be (and was) used for hundreds of policies, as that was faster and easier than waiting while the potential customer looked up their number, or delaying until the customers received a driver license from the state into which they had just relocated.
Note: This last little bit of business wreaked considerable havoc while it went undiscovered. Driver license number is used to look up moving violations for rating the policy, and the data is "enriched" by obtaining a feed from the DMV (based on driver license number, naturally enough). Suddenly, this enriched data on hundreds of policies came back with the same driver information, as one of the "dummy" license numbers commonly used to write the policies happened to correspond to a real person. It was discovered as a result of the effort to uniquely identify customer ("Customer Master").
Who is the customer?
The inability to understand who the customer was, what products the customer had, and what the customer needed led to considerable confusion and what the CRM manager at this company termed "a disastrous customer experience". Oh, and it wasn't much fun for the employees, either. For example the customer might call in to provide a change of address. The person taking the call would actually have to ask the customer what other products the customer had so the rep could adjust the addresses in each of the systems that contained it. Additionally, an insurance agent might call a customers that had an auto policy to try to sell a personal liability or homeowners policy. Imagine the confusion and consternation when the customer informed the agent that he or she already had such a policy with the same company! Feedback from the customer base suggests strongly that not only did the customers feel it was reasonable to expect the agents to have this information, but made them wonder how well a company that did such a poor job of keeping track of things would perform when it came time to process a claim! This probably led to lost business, though this is just one of the impacts of data quality that is hard to quantify.
What was NOT hard to quantify was the lost business that resulted from the inability to tell whether an insured was also a member of the organization. You see, one of the requirements for someone to buy insurance from us was that they needed to be a "member", which involved paying some additional money, and receiving some additional benefits (including the ability to buy our insurance). What would happen is that at renewal, a check would be made to see if the person was still a member. If not, they would be contacted to reenroll them, and if that effort failed, the insurance could not be renewed and a customer was lost. However, very often the person DID in fact still have an active membership, but the lack of a "master customer" hid that fact. But the customer knew they had an active membership, and either didn't respond to such an obvious error or decided (again), that a company that couldn't keep track of its customers didn't deserve their business. Its pretty easy to count the number of policies that lapsed due to "non-membership" and apply some factors to the value of that lost business. Another troubling aspect was that, in order to not lose business, the customer might be given a free membership -- so now quite a few people had TWO. When they were then billed for both, they only paid for one (because there is no reason to have two) and allowed the other one to lapse. This tendency skewed metrics like retention and renewal rates. And just to add insult to injury, we conducted periodic efforts to get these "lapsed" members to renew. The metrics around the success rates of these efforts are also skewed by the fact that a significant percentage of those in the target audience already had an active membership.
Issues with Building the Master Customer List
The need to develop a complete and accurate picture of the customer led initially to an effort to develop a master customer list. This list was culled from the various systems that collect customer information, and additionally linked each master version of the customer to all the products the customer owned. However, the ability to accurately identify that the various versions of a customer stored in siloed systems were in fact the same person depends heavily on having accurate data. Despite a significant effort to create a robust architecture, use of a highly-rated probabilistic matching engine, and months of tuning the results, we were not able to get above about 70% automated good matches, and there were too many false matches as well. This situation caught me by surprise, as I had previously had stunning success with this sort of matching when I implemented it in the pharmacy of the large chain drugstore. There turned out to be two key differences between that effort and the current one which explained the huge difference in results (at the chain drugstore, we got essentially 100% accurate matches and no false linkages).
The first difference was that at the pharmacy we weren't matching data from different systems. Instead, we were matching patient data from different stores, with the idea of creating "central patient" -- a master patient with a complete drug history that could walk into any store in the chain and be recognized. Previously, if a patient went into a store they hadn't been in before, the pharmacy personnel had to take their demographic and insurance information, get a list of drugs they were taking (for drug-drug interactions), and so on. Since each store was running the same software, it was possible to easily map the data from the individual stores into a master database with no assumptions about the meaning of the data or how it was derived. But in the case of the insurance master customer, the data in each system had to be examined, meanings figured out, derivations examined, and then mapped together to get the "same" fields from each system so that the match could be undertaken. Unlike the pharmacy system, each insurance system had its own assumptions. For example, one system had a preponderance of birthdates on 12/31 (we profiled the data to find this out). This turned out to be a vestige of an earlier conversion, and thus we could trust only the year portion of the birthdate when doing matches if the birthdate was 12/31. Another system had a birthdate field, but it turned out (as mentioned earlier) not to contain the birthdate of the policyholder, but of the oldest person in the household.
Its all about the data quality
The major difference, however, turned out to be the quality of the data. The quality of the data in the pharmacy was really, really good. Names, birthdates, addresses, phone numbers, gender, and medical insurance id were spot-on accurate. Even in central California, where there were literally thousands of individuals named Maria Garcia, we were able to match them up and get them right. And the reason that the data was really, really good (high quality)? Simple -- the business provided a powerful incentive to get it right. Unlike many businesses, where simply filling in anything is enough to close the transaction, in the pharmacy the data HAD to be right or the transaction wouldn't go through. Over 95% of the customers were covered by drug coverage insurance. If any of the data was incorrect, the transaction would be rejected by the drug coverage company. It was a Data Quality practitioner's dream -- the data had to be perfect before the transaction could take place. Further, pharmacy personnel, who are incented partially by how many prescriptions are filled and sold, have an incentive to collect the correct data and update the system. Even the customer, who would really rather not pay cash, has an incentive to provide the correct data so they can get their medicine paid for.
Compare that situation with almost any other business. As I said earlier, most data can be collected incorrectly, or at least never updated, and the transaction goes through just fine. The customer-facing personnel are usually incented to be fast (remember the adage: "be careful what you pay for") but not accurate. The systems they use are often aging and have very little data validation built in (this was true of the pharmacy system too, but it didn't matter). And even if you change all that, the customers are often leery of providing personal information. When we discovered the issue with 12/31 birthdates, we tried to clean the data up by calling the customers to get their birthdates. In lots of cases, they refused to provide that information over the phone, some agreed to do it by letter, but we never got most of them. Of course, in the case where the birthdate is needed to write a policy, they'll at least provide that information at renewal, as they expect to do that. But if the information is needed only so you can do customer matching, we have little luck getting the customers to provide that. For example, we don't need your birthdate for a membership (unless you're a minor), so if we suddenly start asking for it, people get suspicious.
Of course, a determined team of people can make headway, so don't get the idea we gave up in frustration. Addresses can be standardized and scrubbed. Names can be parsed (often names are combined in a single field with prefixes and suffixes), key phrases recognized and removed (such as "trustee"), and information enriched from outside sources based on addresses. Business processes can be changed to collect data even in cases where it isn't needed for that transaction -- and incentives changed to incent people to collect the data correctly. Real-time validation of the data (Driver License number, address, etc.) can be put in place, and where overrides are allowed, the number of overrides tracked to see if anyone is abusing the privilege. Little by little we got ahead of it, with our automated matches edging up into the mid-90's and then the high-90's, and false linkages dropping off sharply.
Finally, the stuff we learned was built into the new systems we built, so that data quality was protected as much as possible at entry. That is, we "designed in" data quality. And isn't that the best way?