In the last few months I realize that I have strayed away from the base topic of data and data-related functions. I did this in part to broaden horizons into the subjects of project management and management, two subjects that are also dear to my heart. (I appreciate DATAVERSITY™ supporting me in these side trips on the road of life). I may return to these subjects again from time to time but three separate events this week have me returning to the bulls-eye of the DATAVERSITY target. Now it is time to come back to another topic near and dear to my heart – specifically, data quality.
Those of you that have been fortunate enough (or unfortunate, depending on which side of the fence you sit upon) to hear one of my presentations that touches on the issue of data quality know that I always include a slide that shows that data quality is not a new issue. I reference the so-called ‘Wicked Bible’, printed in 1631, which omitted the word ‘NOT’ from the 7th Commandment, thus reading “Thou Shalt Commit Adultery”. (Obviously this caused quite a stir in 1631, as you can imagine.) The Bibles were recalled, the printer was fined £300 for the mistake and less than a dozen copies still remain. I use this reference to show that any data is susceptible to error, regardless of how sure we are of its correctness or how much effort has gone in to its creation.
While data errors are not new, those of us in the business would like to hope that they are becoming fewer and fewer both in magnitude and frequency. Sadly, my hopes were dashed in a single day this week when three seemingly flagrant (to me) data issues occurred within a mere four hour period.
Incident #1 – We all receive those emails from university’s and other centers for higher learning asking us to partake in their survey related to this topic or that. I believe there is good in such things and when my schedule allows, I always do my best to accommodate these requests. This past Friday morning I had just completed a major project the night before and my meeting schedule was light so the email request that had appeared in my inbox early in the week was retrieved and I decided to do my part to help them out by increasing their ‘n’ by one. I clicked the enclosed hyperlink and immediately jumped to the disclaimer page of the survey. So far … so good. Signifying my understanding of acceptance to participate, I proceeded to page 2, where the trouble started. The first question asked what state I represented. A simple enough question, yet the response area was an input box rather than a drop down. Immediately I smirked, thinking of all of the potential manual data review and manipulation that lay ahead for the survey maker. While wishful thinking would allow for the possibility that everyone could spell their own state name, would they all be capitalized properly or would some include abbreviations – and would those have both letters upper case or one of each as seems to be the custom more often these days? Still it was a relatively minor point. On to Question 2 …
This question asked for my job role and listed four options plus ‘Other’. Since none of them correctly fit my role, I typed my title into the ‘Other’ input box and selected ‘Next’ to move forward to page 3. Imagine my surprise when I received an error in nice neat red text above question 2 stating that ‘This question requires an answer.’ I looked from the questionably compliant lettering (although many do this, red is one of the leading color blindness colors, is it not?) to the completed input box and back again. In less than a second it became obvious to me. There was no radio button in front of the ‘Other’ option so my text was not being recognized as a response. I switched back to the original email request and sent a nice message to the listed point of contact, explaining my issue and included a section of a screen print showing the error and the input box. Amazingly, within 5 minutes I had received a response stating that he was unaware of the problem but they had made the necessary changes to allow for the input box information to be acknowledged and thanked me for bringing it to his attention.
Ladies and Gentlemen, I know that I received this survey request from a listing of several hundred individuals that includes state CIOs and senior IT executives across the nation. The fact that in the 5 days since the email had been sent I was the first person to contact them told me one of four things was probably happening:
- No one had yet responded to the survey (unlikely)
- Everyone that had taken the survey conveniently fit into one of the four classifications listed (unlikely)
- Responders got the error message, said ‘enough of this’ and exited out (quite possibly)
- Those that got the error message switched their answer to one of the four roles listed and continued on (quite likely)
Based on these possibilities, how credible, to me, is the survey overall, without even knowing what the remaining questions were? Was I going to anxiously wait for the compiled results, traditionally sent to those that participate in the survey? I think not.
Incident #2 – An organization I am closely associated with had requested input on the types of functions staff performed on a regular basis, special requests, etc. to help better understand overall work capacity. The quarterly submissions were in, compiled and the results were sent around to the management team. These numbers feed a series of metrics that I prepare for the organization, so I settled in to prepare the updates. My files are set up to start at the top so executive management was the first category. Since I had been doing these updates for several years, I immediately noticed a discrepancy. An entire staff grouping had been omitted from the report. When I contacted the individual that had provided the spreadsheet, I was told, “We don’t create the numbers, we just run them.”
While I understand that this particular file may not be one of the most popular that is distributed for review, perhaps a few comments contained within the email might have helped to explain changes or exclusions from the summary. Without such comments, the summary is believed to be accurate and unless you look closely at the supporting tabs, such inaccuracies would go unnoticed.
Incident #3 – Another survey, this time forwarded to me by my boss to complete. This falls under the category of ‘other duties as assigned’ and I have no problem with performing such requests when time allows. It turned out to be a review of the advertisements and articles of a rather popular trade magazine that I do read regularly. After confirming all of my information was correct, I delved into the first question. It showed a full page ad from the latest issue and had several questions beside it that I will paraphrase. Question #1 – “Do you recall seeing this ad in last month’s magazine?” (Yes or No) Question #2 – When you saw it, did you read the ad far enough to determine what message was being delivered?” (Yes or No)
Naturally, I had selected ‘No’ as the response to #1 and I now sat in a mental quandary as to how I should respond to #2. Of course, I couldn’t leave it blank (yes, I tried). So if I replied ‘Yes’, how did that correlate to my ‘No’ for Question #1. And if I replied negatively to #2, would they understand that I was merely saying ‘No’ because I hadn’t actually recalled seeing it versus the case where the advertising firm may lose the client if enough people, like me, responded with a ‘No’ that really should have been a third choice ‘n/a’ – or even better, don’t even show me the question based on my previous answer.
There you have it. Three incidents, in a single day, that were going to cause data quality issues in the future or where the issues had already occurred.
In relation to the two surveys, I have to ask – where was the review team or the testing team before each survey was released to the masses? Didn’t they run a single script against these or even have one person legitimately try to ‘break it’? Having been in the business for a couple decades now, this seems to me like development 101 and (getting back to one of my other subjects I love to discuss) project management 101. As for the results of these surveys, and the spreadsheet I received, I guess the old adage “GIGO” (garbage in, garbage out) still holds true.
A former CISO friend of mine used to say that security was everyone’s business and responsibility. I believe that holds true for data quality as well. Everyone has a part to play in the data quality equation. What’s yours?