You are here:  Home  >  Data Blogs | Information From Enterprise Leaders  >  Current Article

Data Quality Ain’t Lost

By   /  June 27, 2012  /  3 Comments

by David Plotkin

The famous American frontiersman Daniel Boone was once asked if he was lost. “No”, he replied, “lost means you don’t know where you are. I know where I am. It’s how to get to where I am going that has me a mite perplexed.” I thought about this definition of “lost” recently when a professional acquaintance complained that his company’s data used to be of good quality, but somehow that high-quality data was lost.

Of course, in the sense that Daniel Boone used the word, it is rare that the quality of data is ever really “lost”. If you think about the business processes and how the data is used, you can usually figure out what happened to the quality. There aren’t really that many options for why good data goes bad:

1. It might be that nothing changed at all. This is true a surprising amount of the time. The data quality was fine for the purposes it was used for before, but it was insufficient for a new purpose. In the introduction to Danette McGilvray’s seminal book on Data Quality, there is a story about a pharmacy data system where all sorts of symbols were appended to the patient’s last name to indicate information such as the fact the patient had another insurance, or it was workman’s compensation coverage, or a whole host of other information that the aging pharmacy system had no fields for. This was not a problem because the data in the last name field wasn’t used for anything else. But then…the business process changed and the data was put to use for a new purpose. Specifically, the pharmacy began sending out refill reminders, using the patient’s name to generate the mailing labels. After the first batch went out, there was a flood of angry calls from people wondering why their name showed up with all sorts of symbols after the name.  And why do I know this story so well? Well, guess who had to write the program to strip all those characters off the names before the labels were generated?

2. The data might have started getting put in wrong. There are a whole host of reasons why this can happen. If the business suddenly starts putting a premium on speed of data input, speed is exactly what you are going to get. The people incented to be fast will figure out every shortcut, use defaults where possible, skip every field they can, and so on. As the old adage goes — be careful what you measure. Another problem can be training — if you suddenly bring on a new crew of people to do the work (for example, opening a new contact center with new employees) or start using a new application, people may not know HOW to enter quality data. This is especially true in companies where the QA on new applications is done hastily or with no attention paid to the quality of the screen layout and labeling, enforcement of business rules (did you even collect them?), and metadata rules, such as lists of valid values. Most people don’t climb out of bed in the morning saying “today I’m going to put in crappy data”, but you have to make it easy to do the right thing. This involves good application design and adequate training. And incenting for quality!

3. The problem may not actually be a “data quality” problem, but instead a “metadata quality” problem. The classic example is a term that “everyone knows the meaning of”. A friend told me a story about a group of BI developers spending more than a week trying to figure out why numbers reported by two different groups simply didn’t balance — in fact, weren’t even close. My friend even got both groups to define the term (which happened to be “transaction”) and both used the same definition. The problem? One group counted transactions that completed and for which a payment was successfully received. The other group counted every attempt to complete, including multiple declines by the credit card company. This failure of the “derivation rule” caused the wide variation. Ouch!

4. And yes, our friends in IT can occasionally bollox up the works by using the wrong source or scrambling an ETL (extract, transform, and load) job. But if changes are carefully designed, adequately tested (including by the business) and documented, this doesn’t happen too often. Again, this sort of thing should be caught during the testing phase if you are actually looking at the data. A bad mapping will likely show up as nonsensical results, incorrect implementation of business rules should show up as an error either during the application run or in the results. The key thing to remember is that IT is NOT responsible for the quality of the data. My friend Laura Cullen ran the Enterprise Data Warehouse for a big bank where I worked. One day one of my business peers asked her why the Warehouse didn’t deliver quality data. Laura replied that she would love to deliver quality, but she needed three things to do it. The first was that the business had to tell her what was meant by “quality” — in other words, a set of data quality rules. The second was a set of instructions on what she was to do with the data that didn’t meet those rules. For example, should she stop the load? Don’t write the bad records? Write out error messages? You get the idea. Finally, she needed funding to build the code for the rule engine that would enforce the data quality rules and detect where they were being violated. A very wise woman, my friend Laura.

And so, data quality is seldom lost. Of course, getting the quality to where you need it to go might seem a mite perplexing. But that’s another story.

About the author

David Plotkin is an Advisory Consultant for EMC, helping clients implement or mature Data Governance programs in their organizations. He has previously served in the capacity of Manager of Data Governance for the AAA of Northern Ca, Nevada, and Utah; Manager of Data Quality for a large bank; and Data Administration Manager at a drug store chain. He has been working with data modeling, data governance, metadata and data quality for over 20 years. He serves as a subject matter expert on many topics around metadata, data governance, and data quality, and speaks often at industry conferences.

  • John Biderman

    Of all these excellent points, I’ve run into #2 repeatedly in situations where data was converted from a legacy system into a new one. The legacy systems were lenient or lax in enforcing data quality standards (for example, an old database system that enforced no referential integrity and few data-entry constraints; or another system that did no duplicate checking), but when the data became visible in the modern systems all kinds of issues started to surface. This leads to what I think is a general maxim that you should NEVER, ever convert data from an old platform to a new one without doing data quality checks and, if possible, cleansing. Why fill up a nice new system with all your legacy junk?

    • David Plotkin

      Hi John,
      Excellent point. In fact, it is often true that when you try to convert data from a legacy system with few rules into one where the rules have been carefully designed, it simply won’t go, at least not if the conversion programs enforce the new rules. And I couldn’t agree with you more that you NEVER, EVER convert data from an old platform to a new one without doing data quality checks (and cleansing). Seems like an awfully obvious point, but you know how that goes!

  • Excellent article. Your friend Laura’s approach on point #4 should be drummed into anyone handling data from an early age.

    The derivation rule problem is often a sign (in my experience) of poor or inadequately written requirements. Granted it is the responsibility of the developer to check the requirements but all too often we fill in the blanks based on past experience.

    Lots of great lessons here. Thanks!


You might also like...

Data Science in 90 Seconds: K-Means Clustering

Read More →