A Report of Michael Smilg’s Enterprise Data World 2011 Conference Presentation
by Charles Roe
Establishing a common framework for discussing data quality within an enterprise is no simple task, especially when such an enterprise is as substantial as Allstate, the largest publicly held personal lines insurer in the USA. Allstate is a fortune 100 company with more than $130 billion in assets, has 13 major lines of insurance, along with many retirement and investment products in its extensive portfolio. It employs more than 70,000 professionals around the world. Allstate has reinvented the idea of protection and retirement assistance, and as of 2011 safeguards approximately 17 million households. The technology environment at Allstate includes some 4,000+ IT professionals, 5,000+ software applications, 100,000+ supported computers in an array of operating systems, technology platforms and database systems. Allstate’s applications and services run the technology gamut with everything from advanced analytics to capacity planning, enterprise content management to service oriented architecture, ETL tools to financial applications and a host of others. Dealing with data quality issues in such a vast enterprise is certainly no easy undertaking.
During his Enterprise Data World 2011 Conference presentation, Michael Smilg – an Information Analyst at Allstate – discussed data quality in general and SDLC from a data quality viewpoint, with some conceptual emphasis placed on an in-house developed procedure called the Data Quality Planning Tool (DQPT). Due to Allstate’s proprietary rights and a desire to keep a competitive edge in the insurance industry, Mr. Smilg did not actually show the DQPT itself, nor give any direct examples from the tool. Instead, he discussed data quality in terms of concepts used within his Allstate data quality experience and how to apply them outside of the actual tool, so that anyone in the data management field can utilize such concepts for their organizations.
Enterprise Data Warehouse (EDW) Architecture
The Allstate EDW is setup much the same as any other large EDW used by many other enterprises. Standard EDW architectures employ fairly regular structures, with different taxonomies, and Allstate’s is not different; it includes:
- Raw Layer – This layer takes feeds from various operational source systems as-is and loads straight in with no transformations; it is only one database file structure to another.
- Standard Layer – Integration takes place here with some transformation and is setup like a traditional data warehouse. There is no suspense file, no error detection; everything goes into this layer regardless. There is some standardization, but no enrichment or cleansing. The enterprise still employs an old MDM system utility that is used by business participants on the front end as well.
- Presentation Layer – The tables and views defined in the database live here. It is the structure where database reorganization takes place, where complex business rules are applied and the data is fit-for-use with analytics
- Universes (Business Objects) – These are the actual constructs that are built into analytic tools that sit on top of the universes and use the universes as the raw input.
In terms of data integration, specific “transactional” feeds into the EDW are not actually sources. For example, a quote file coming in as a feed from an agent has information like name, address, phone number and others. The EDW does not use that feed as a source or system of records for agent information. But instead, there are feeds directly from agent data stores that link that quote to the agent data store, rather than use the information from the transaction. This structure sets up some specific data quality issues, and added to that are changes over time regarding a shift in focus within the EDW from purely analytical to operational/managerial.
The EDW was originally only needed for general analytics, so it only had to be directionally correct in terms of reporting; there was no need for 100% accuracy with all the data. But, as time progressed, the Allstate EDW has taken on more functions and it is now used for transactions like the paying of agent commissions, customer refunds and many others that need truly accurate data. Thus, as the EDW has grown and transformed, data quality has become a much higher level problem and so the enterprise needed much more defined data quality concepts and a more stringent data quality system.
Data Quality Concepts
Data quality is a program not a project; that idea is perhaps one of the most difficult to sell to any enterprise. But, without an enterprise-wide understanding of such a concept, data quality will take a back seat to other programs and long-term data quality issues can become prevalent. According to a quote by Larry English, data quality is fit for all uses and is “consistently meeting knowledge worker and end customer expectations.” Data quality is also different based on data element; data user and can have many alterations throughout different stages in the life of an enterprise. Therefore, within a data quality system not all fields are always of equal importance and their importance will also change over time depending on the needs of the enterprise and the users employing that data. Regarding the data changing over time the words “change” and “time” are of particular importance:
- Change: This idea refers to the fact that the actual content changes. Something happens in a given front end system. It’s possible that experienced data gatherers in an enterprise are replaced by inexperienced data gatherers, that a data quality training program is stopped due to budget overruns, so new data becomes less clean than older data, or that a manager decides they want to cut costs and speed up processes, so data quality falls to the wayside.
- Time: The expectations of how data is used changes over time. The yardstick of how the data is used, why it is used and who use it changes over time as well. Data quality expectations for marketing are different than data quality expectations for agent bonuses or customer refunds.
There are a number of other important elements in a data quality concept that include:
- Data quality is not quality control: Where data quality is fit-for-all-uses, it also is a means to an end, not the end in itself. Data quality never ends, only certain projects end. Quality control is actually a monitoring system to see if various programs (such as an ETL job) are working per spec. If the spec says if A then do B, it is the job of quality control to check to make sure that spec if being followed.
- Business Impact: Data quality must always be viewed in terms of the business impact, not through the lens of arcane measurements and confusing figures. The bottom line is at stake here, so a proper data quality system should work for the business and not the other way around. The business must fund the data quality effort, so value is of central importance.
- Data Surprises: Perceived data quality issues, or “data surprises,” are often due to misunderstandings in usage rules. Such problems may be fixed through proper training, but others might actually stem from incomplete or inaccurate metadata and so the professionals in charge of such issues need to fix them or more errors will continue to propagate the system.
Data quality needs to be sold within an enterprise as a program that never ceases. Often data quality becomes a central point due to quality control issues that show up because of previous (and possibly long term) data problems. A program is started, data quality becomes a buzzword, people are trained, the data is essentially “fixed” and the program ends. Sometime later, the problem shows up again and again and again. Thus, data is not an end, it is only a means and so a comprehensive data quality program needs to be ever present within an enterprise.
Four Parts to Resolving a Data Quality Problem
A data quality problem could be something new, or could be a problem that continually arises, especially if data quality is only a project that happens when problems occur. There are four major parts or ways to resolve a data quality issue and they are best looked at in terms of an analogy. Plumbers are experts at fixing leaky pipes, just as data quality professionals are experts at fixing data quality. In this analogy of the leaky pipe, there is only a pool of water and no other detection device present.
- Fix the Leak (future-forward option): This future-forward decision for the data quality plumber doesn’t look at what was going on, but rather what will happen later. They focus on the immediate cause, whether it’s an ETL problem or bad source data problem, fix it and make sure it doesn’t happen again.
- Clean up the Puddle (past history): If data quality is a problem, say within a Presentation Layer database of agent source data, then remediating the data that is already loaded could fix the leak. As a past history problem, the data quality professional/plumber is going backwards, looking at the puddle that exists and cleaning that up to stop the problem from its source.
- Install Water Detection (future forward): As another example of a future-forward option, this fix requires setting up a system of checks and balances to make sure any future water issues do not happen again. Such a data quality system could mean establishing responsibility, creating a Service Level Agreement, running more comprehensive data quality system reports or other “sump pump” like fixes so that the water is detected as soon as it starts to leak.
- Analyze Pattern of Leaks and Take Preventative Measures (future forward and past history): A dual approach would be to look deeper into the data quality problem, figure out where and why such problems are happening, then put a system in place to stop them. This is a much more comprehensive approach to data quality, but also more expensive. It may be necessary to setup incentives within an enterprise for practicing good data quality, especially for those entering the data into front-end systems. If insufficient editing, poor design, lack of communication, incomplete policies or any number of other problems are causing the proverbial pipes to freeze, then these issues need to be understood and proper future-forward preventative measures taken to stop them from happening later.
The worst possible approach is to just ignore the problem and hope it goes away. There are times when plumbing problems do not show up for a long time and then only as ceiling discolorations or cracked paint. Such data quality issues may exist with the loading of source data, with the raw data structures in the lowest layers and many others. They are often the most difficult to fix and ignoring them will only cause serious structural problems and extensive costs later on. Just like leaky pipes, poor data quality is a structural problem within an enterprise and needs to be dealt with constantly before frozen pipes burst and turn the entire building into lake.