The famous American frontiersman Daniel Boone was once asked if he was lost. "No", he replied, "lost means you don't know where you are. I know where I am. It's how to get to where I am going that has me a mite perplexed." I thought about this definition of "lost" recently when a professional acquaintance complained that his company's data used to be of good quality, but somehow that high-quality data was lost.
Of course, in the sense that Daniel Boone used the word, it is rare that the quality of data is ever really "lost". If you think about the business processes and how the data is used, you can usually figure out what happened to the quality. There aren't really that many options for why good data goes bad:
1. It might be that nothing changed at all. This is true a surprising amount of the time. The data quality was fine for the purposes it was used for before, but it was insufficient for a new purpose. In the introduction to Danette McGilvray's seminal book on Data Quality, there is a story about a pharmacy data system where all sorts of symbols were appended to the patient's last name to indicate information such as the fact the patient had another insurance, or it was workman's compensation coverage, or a whole host of other information that the aging pharmacy system had no fields for. This was not a problem because the data in the last name field wasn't used for anything else. But then...the business process changed and the data was put to use for a new purpose. Specifically, the pharmacy began sending out refill reminders, using the patient's name to generate the mailing labels. After the first batch went out, there was a flood of angry calls from people wondering why their name showed up with all sorts of symbols after the name. And why do I know this story so well? Well, guess who had to write the program to strip all those characters off the names before the labels were generated?
2. The data might have started getting put in wrong. There are a whole host of reasons why this can happen. If the business suddenly starts putting a premium on speed of data input, speed is exactly what you are going to get. The people incented to be fast will figure out every shortcut, use defaults where possible, skip every field they can, and so on. As the old adage goes -- be careful what you measure. Another problem can be training -- if you suddenly bring on a new crew of people to do the work (for example, opening a new contact center with new employees) or start using a new application, people may not know HOW to enter quality data. This is especially true in companies where the QA on new applications is done hastily or with no attention paid to the quality of the screen layout and labeling, enforcement of business rules (did you even collect them?), and metadata rules, such as lists of valid values. Most people don't climb out of bed in the morning saying "today I'm going to put in crappy data", but you have to make it easy to do the right thing. This involves good application design and adequate training. And incenting for quality!
3. The problem may not actually be a "data quality" problem, but instead a "metadata quality" problem. The classic example is a term that "everyone knows the meaning of". A friend told me a story about a group of BI developers spending more than a week trying to figure out why numbers reported by two different groups simply didn't balance -- in fact, weren't even close. My friend even got both groups to define the term (which happened to be "transaction") and both used the same definition. The problem? One group counted transactions that completed and for which a payment was successfully received. The other group counted every attempt to complete, including multiple declines by the credit card company. This failure of the "derivation rule" caused the wide variation. Ouch!
4. And yes, our friends in IT can occasionally bollox up the works by using the wrong source or scrambling an ETL (extract, transform, and load) job. But if changes are carefully designed, adequately tested (including by the business) and documented, this doesn't happen too often. Again, this sort of thing should be caught during the testing phase if you are actually looking at the data. A bad mapping will likely show up as nonsensical results, incorrect implementation of business rules should show up as an error either during the application run or in the results. The key thing to remember is that IT is NOT responsible for the quality of the data. My friend Laura Cullen ran the Enterprise Data Warehouse for a big bank where I worked. One day one of my business peers asked her why the Warehouse didn't deliver quality data. Laura replied that she would love to deliver quality, but she needed three things to do it. The first was that the business had to tell her what was meant by "quality" -- in other words, a set of data quality rules. The second was a set of instructions on what she was to do with the data that didn't meet those rules. For example, should she stop the load? Don't write the bad records? Write out error messages? You get the idea. Finally, she needed funding to build the code for the rule engine that would enforce the data quality rules and detect where they were being violated. A very wise woman, my friend Laura.
And so, data quality is seldom lost. Of course, getting the quality to where you need it to go might seem a mite perplexing. But that's another story.