A Report of Michael Smilg’s Enterprise Data World 2011 Conference Presentation
by Charles Roe
You are now reading Part 2 of the report based on Michael Smilg’s Enterprise Data World 2011 Conference presentation titled “Planning for Data Quality in the SDLC.” The first part of this report had a short introduction on Allstate, some of the complications faced by such a large enterprise when dealing with issues of Data Quality and a general introduction into the Enterprise Data Warehouse (EDW) that Allstate uses to provide a framework for a deeper discussion of Data Quality.
The primary focus of Part 1 was to outline some of the main concepts that exist within Data Quality including change over time, Data quality is not quality control, the business impact of Data Quality and data surprises. These were discussed under the overarching idea that Data Quality is a program not a project and that it is necessary to sell that idea to the primary stakeholders within an enterprise if Data Quality is ever going to become a prominent element in the SDLC. An analogy of leaky pipes was then used to highlight the four essential ways to resolve a Data Quality problem:
- Fix the leak
- Clean up the puddle
- Install water detection
- Analyze the pattern of leaks and take preventative measures
Part 2 will continue the discussion of Data Quality within an enterprise and highlight some of the most important Data Quality assessment techniques that Data Quality professionals can use to find data errors. It will then discuss Allstate’s DQPT in general terms – no specific examples are used – and provide an overview of what such a tool can do in terms of implementing Data Quality into the SDLC of a given project. The conclusion will discuss the SDLC from a data quality perspective.
Data Quality Assessment Techniques
Data Quality needs to be a deliberate, visible, documented program within an enterprise, built upon a structured and systematic methodology that all the various stakeholders involve understand and follow – no one wants to drown in bad data due to continually broken pipes. There are countless data quality assessment techniques in the tool bags of data quality professionals:
- Data Profiling: This technique may not tell a data quality professional much about a specific data quality issue, but it is an excellent starting point to narrow down problems. For example, is having an activity event code that is 25% null bad or acceptable? Such questions need answers before necessary fixes can be completed. It is then possible to use conditional and segmented filtering options to filter down the possible problem areas.
- Event Logs: There are countless different types of logs and error types for a given organization and many are important for assessing data quality. At Allstate, they load all their data into the raw layer, there is no error fixing upon load. So, the event logs can then be studied, errors checked and eventually as the data moves up layers, the data quality issues can more easily be addressed and fixed if necessary. Error logs can look for such issues as referential integrity, code verification, data verification, unexpected input, transformation/lookup errors and so many more.
- Validation Audit Rules: These can be done for the technical and business ends of the process.
- Parallel Summary Reports: These give automated comparisons.
- Source to Target Compares: Look at the data as it moves from the source to target system.
- Manual Spot Checks: They can be done as random checks and deliberate checks on specific areas, both known problems and checks to see if certain characteristics do really exist.
- End to End Use Cases: Get a total picture of a specific data case across the entire spectrum.
- ETL Run Logs: Check for errors during ETL phases.
The unfortunate truth of any of these kinds of assessments, along with the implementation of a data quality program in general, is they are not free. There is always an immediate short-term cost for data quality and the benefits are long-term, diffuse and the best data quality system is one that no one ever sees – it just works. Executives love metrics, but data quality metrics are rather ephemeral. While it is possible to show the sort of assessments being done, the improvements are more invisible than evident. Data Quality is certainly a solution to already known data problems, but if the program is then disbanded after 30 days and no more data quality professionals are employed within the enterprise, then all the hard work that has gone into the program will be lost and the data will eventually revert back to being leaky pipes.
Overview of the Data Quality Planning Tool (DQPT)
Enterprises need mechanisms that can systematically capture Data Quality requirements and implement assessment techniques capably throughout the entire organization. This report will not include any specific examples of the DQPT, but rather is only an overview of some of the methodologies involved in the creation and use of the DQPT. Allstate created the DQPT and it has worked well for them. Allstate needed the DQPT so that:
- Appropriate tasks could be included in the project plan, along with the schedule and cost estimates
- Data Quality requirements and assessment techniques could be clearly communicated within the enterprise to everyone
- Revisions and Data Quality details could be captured in a single document over the life of the program
- Post project Data Quality assessment could be performed
The primary focus of the tool was project planning, so that Data Quality plans could be developed and thus ensure that Data Quality tasks were not overlooked during the schedule and budgeting phases. The tool is a short 2 ½ page questionnaire filled out jointly by involved Project Managers, Business Analysts, Systems Analysts, and DQ SMEs it includes embedded job aids and links to Best Practices, Standards and other relevant materials on the Data Quality site. It allows such issues like data profiling, checking error logs, code checks and many others to be better understood throughout the SDLC. The questionnaire provided a short, but systematic technique to allow the stakeholders a system to think through Data Quality requirements for a given project, to consider the various validation techniques that could assess the Data Quality and then to include those Data Quality tasks in the project plan. Some general questions that could be used in such a questionnaire include, but are certainly not limited to:
- What data is involved in this project?
- What is the flow of that data from source to target?
- Who is the primary intended user and what is their intended use of the data?
- What aspects of data quality (e.g., consistency, integrity, validity, duplication, etc.) are relevant and fit-for-use?
- What is the required level of quality (tolerance of DQ problems) for each of the relevant aspects of Data Quality?
- What Data Quality assessment/validation techniques will be used?
- How do these questions relate to various phases of the project? Analysis, Test, Roll Out, On-going etc.?
- How do these questions relate to the different data to be validated? Source, Extract, EDW Raw, EDW Standard, EDW Presentation, Universe etc.?
- How do these questions relate to the relevant aspects of Data Quality?
- What are the processes, roles and responsibilities for on-going data validation post go-live?
- What is the communication channel for the business users to raise data questions and have them addressed in a timely matter?
Those questions above are only a small overview of what could go into a questionnaire like the DQPT; there are innumerable possibilities and they need to relate specifically to the given Data Quality program that exists within an enterprise. The questionnaire needs to be comprehensive enough to cover all the necessary elements, but not so technically detailed to cause “pushback” from upper management or business-end stakeholders who do not see the need to give such involved answers – balance is necessary. The timing of the questionnaire is also important; it is necessary not to use such a tool too early in the SDLC when many of the questions are not yet answerable, but also not too late in the process that changes cannot be made – somewhere near end the project’s Analysis Phase is probably best.
Conclusion – SDLC from a Data Quality Perspective
Refining the SDLC within a Data Quality point-of-view is not necessarily a simple task, it involves buy in from various levels of a given project, along with changes in budgets, scheduling and other elements that some stakeholders might not be too enthusiastic about. But, such changes yield significant and on-going benefits that include a substantial improvement within a project’s data processes, less dependence on data inspections to achieve quality, the eventual elimination of the need for such inspections by building Data Quality into projects in the first place. These anticipated benefits also include an improvement across the board concerning business decisions being made, the Data Quality within the data warehouse, the reputation and productivity within virtually all IT areas and specifically within the Data Quality project team’s output. The improvements within the DQ team also include more trust throughout the enterprise concerning Data Quality, which also help to prevent last minute crises and drama. An effective Data Quality program should ultimately be invisible to most people in an enterprise, but while invisible it should be working in the background as an integral program, included into project plans, training and the entire SDLC. Data Quality should never be an issue of leaky pipes, peeling paint, cracked walls and one that will ultimately be blamed on the IT branch within an enterprise. Good quality data is an asset to all organizations no matter their size, thus Data Quality should always have a prominent position within any SDLC.