Getting data governance "done" in a rigorous manner can start with integrating new processes and deliverables into your project methodology. The basic items that need to be collected (and are usually NOT collected) are:
- Business definitions: we're not talking about the usual one-line, sort of useless "definitions" that are really just an expansion of a physical column name. No, what we need here are robust definitions that fully describe the business concept, why the data element is important to the business, why that data is captured (and by who, if appropriate). If you capture these definitions and add them to an enterprise metadata repository, they become reusable and a great source of information when trying to understand what you're looking at in a column, on a report, and just to understand how the company does business.
- Derivations: One of the reasons that the numbers never seem to balance between departments or reports that purport to calculate the same result is that the details of the calculation are different. Not only might the formulas not match, but the inclusion/exclusion rules might be different as well. For example, the included records might have different timeframes. I remember one example where two reports used exactly the same calculation -- but one used the end point as a particular status date, and the other used the next day. As with definitions, capturing the derivations supplies a way to agree and a documented derivation for the enterprise.
- Data quality rules: Its pretty hard to decide on whether data is of good quality (however you define that) if you don't have a rule for determining that very fact. It might be as simple as "this should never be negative" or it might be a set of complex rules that join the instance of a record in one table (account) to the existence of a record in another table (Borrower). When projects use or change data, the data quality rules for any data elements with a perceived issue need to be recorded. By doing so, you have the basis to apply data profiling and actually measure the quality of the data. Of course, once you DO look at data, you may need to adjust the rule to allow for unforeseen data that is present but does not present a problem. These "oh yeahs" can be something like a set of special values ("unknown", 999-99-9999) that have meaning and which are allowed to be in the data without causing a business problem.
These are basic items that need to be captured and documented as part of a project methodology. There may be more, especially if data profiling is considered to be part of a project, in which case the results of the profiling could be considered a project deliverable.
Once you've collected these items, you need to document them. You can use Word documents or a spreadsheet during the project, but the results should be published into a metadata repository (or its equivalent at your company) as soon as the items are finalized. You'll need to provide very specific, field by field, instructions on how to fill out the project documents, as the items being recorded may not be very familiar to the project personnel. Ideally (as discussed in the next post), a person or persons on the project should have specific roles to collect and document this information. The persons fulfilling these roles will need to be trained and perhaps even supplied by the Data Governance organization, although the project will need to fund their efforts.
Funding can be a bit dicey, as the value provided by collecting this information may not be immediately obvious to the project, and so there may be some pushback to including and funding these roles. But the value is certainly present, and that needs to be made clear to the Project Manager. The value includes:
- Making sure that there are no misunderstandings of what the data is and how it is used and calculated on the project. These sorts of misunderstandings can lead to tail-chasing discussions and considerable rework.
- Ensuring that code doesn't have to get reworked because of misunderstood data. There are usually two outcomes to not understanding what constitutes quality data. The first is that the code simply doesn't run correctly, and has to be revised again and again as more revelations are exposed about the data. The second is that the programmers profile the data themselves to avoid said revisions (they don't trust the data), but the results are not reusable because it is not documented properly. Also, it is usually vastly more expensive for individual programmers to examine the data on their own than to have a proactive effort to profile the data based on the data quality rules. Plus, having the programmers do it ad-hoc throws off the project timeline, whereas planning to do it as part of the project means that the time and resources needed are built into the project timeline.
- Providing reusable intelligence for future projects. Collecting this information is a "do it once" task, from then on, any project that uses this same data can simply look up the results.
- Overall better knowledge of the enterprise's data asset is a huge benefit to the enterprise as a whole, especially when it comes to making intelligent business decisions and understanding what the customer wants. This value proposition only carries weight, however, when projects management is at an enterprise level, siloed management tends to consider this a non-starter because it doesn't directly impact their project.
So, now that we understand the value of doing this work to a project (and to the enterprise), the next post on this topic will explain the "nuts and bolts" of getting it done. Stay tuned...