Your Data Quality Situation is Unique (But it Really isn’t)

By on

Click to learn more about author Kevin W. McCarthy.

I’ve been involved in hundreds of Data Quality implementations. There is one thing every project has in common: The project team thinks their data is different than anything I might have seen before. To some extent that is true; each company has its own volumes of data records and this data may be stored in a number of unique environments. In addition, they each have specific data elements related to their vertical (claims data for insurance companies, inventory and product information for retailers, and so on), and also specific to their company itself. All this is indicative of a highly unique data landscape for every individual company, right?

Well, not necessarily. At the end of the day, data is data. I’m not talking about .mpegs or .gifs, but good old-fashioned customer information. ASCII hasn’t changed in 50-plus years (and for you old-timers, EBCDIC hasn’t either!). Names and addresses are complex, but generally adhere to a consistent structure. Platforms and storage have changed – from mainframes, to UNIX, to PCs and from flat-files, to relational DBs, to distributed clusters – but the data itself has been relatively the same. This is a good thing, and why we have a rich and robust Data Management industry helping companies harness the power of their data. The data isn’t all that unique, but how companies choose to organize, maintain, and utilize this data is where the differences come into play.

An important part of any data project is the process of analyzing the current state of the data at hand, which is commonly called data profiling. I would argue that this is the first necessary step in any Data Management initiative. The process of data discovery is also one of the best examples of how data is generally the same. Guaranteed you will find date columns with a bunch of 01/01/1900s. You’ll find blanks in fields that must have a field populated. And you will consistently find junk roaming around in your name and address information – account numbers and SSNs in name fields, descriptions, or comments in street lines – stuff that doesn’t belong there, but the data entry person couldn’t find a better place for it! This is particularly true for “error-reducing” UI interfaces as well because the more restrictions that are placed on data entry, the more creative people get when they try to enter information that doesn’t quite fit.

In practice, I don’t think I ever once told anyone during one of my consulting engagements, “Your data is the same as everyone else’s!” They didn’t want to hear it, and I wanted to be able to perform magic in anticipating, understanding, and resolving their “totally unique” Data Quality situations. But, if you can swallow the fact that you’ve got the same issues as everyone else, the silver lining is that there are likely tried-and-tested ways to address those issues. 

In fact, the use of industry consultants or analyst recommendations paired with prepackaged Data Management software applications takes advantage of the similarities of data issues. The software can even have preconfigured and prepopulated rules and transformations to deal with the exact scenarios that most people encounter with their data. This is the value that these packages provide, as opposed to using coding and SQL to write these transformations from scratch. The scales will tip quickly toward an ROI on the software once your data reaches a volume and complexity where more and more varied issues are going to be encountered, and the manual intervention will get to be too much.

My advice is to be happy that you’re not special, at least when it comes to your data! Swimming in the same data pool (or lake!) as everyone else has its benefits in leveraging the software solutions that have already been created.

Leave a Reply