Data administration has in the past placed great emphasis on the logical level of data management. This has meant that data models, business processes, architectures, and application functionality have received a great deal more attention than physical data values in production databases. Yet, master data management (MDM) is fundamentally concerned with managing physical data values in production databases. These management requirements cannot be fully answered by abstracting the problems of MDM back to logical layer of data models, architectures, and so on.
Thus, MDM spans the boundary of the logical and physical worlds of data. Legacy data administration tried to maintain these worlds as separate, with a general attitude of “we design the databases, but whatever the users put in them is their concern”. The result has been that nearly all the advances made by data administration in the past few decades have been confined to the logical level, with very little being done to address needs at the physical level. For instance, data model patterns and levels of abstraction, such as the “Party” pattern, have been the subject of much attention, but there is no detailed methodology for mapping production data assets (a topic known as data cartography).
For MDM this is a very serious problem. Attempts to implement MDM using the approaches developed by legacy data administration will never adequately address the requirements of physical master data. Therefore, it is necessary to understand what the management requirements of physical master data are in order to meet the MDM challenge. This article considers examples in one particular area – the governance of reference data values – where physical management needs are often in great need.
Unfortunately, reference data means different things to different people. The definition used here is:
Reference data is any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise.
This definition means that in practice reference data tables are what is commonly called “code tables”, or “lookup tables” or “domain values”. These are typically tables with a key column containing a code, and another column containing a description. Sometimes reference data tables can have additional columns, which may include foreign keys from other reference data tables. Examples of reference data tables include Country, Currency, Customer Type, and Product Category.
Overall, little attention is paid to the population of reference data values. However, reference data tables typically make up anywhere from 20% – 50% of a database in terms of numbers of tables.
There is typically very little governance around data production in these tables. Thus, a database whose design is in third normal form can still have up to half of its tables populated in a very chaotic manner. In other words, excellence at the logical level does not guarantee very much for the physical values of this class of master data. Simply expecting users to manage the physical values well, without even providing any guidelines on how to do this – let alone implementing any governance processes – is an escape from the responsibilities of MDM. Let us examine one of the management needs around physical data values in this area.
Definition of All Required Reference Data Tables
It is not uncommon to find database tables such as Product or Customer that contain code columns which should be foreign keys, but for which no parent reference data table exists. The code values, therefore, exist only in the database table where they are found. If a new code has to be added, it can only be added as part of a new record. If a list of the existing set of codes is required, this can only be done by extracting them from the table where they are found. Since no parent table exists, no descriptions are available for the codes, and no metadata is available for them, e.g. who added or updated the codes and when this was done.
This problem can arise in various ways, some of them rather subtle. For instance, an indicator can evolve from a “Y” or “N” to a range of code values.
The Governance Challenge
Once such a problem is detected, what should be done and who is responsible for doing it? Firstly, it is beneficial to detect such a problem as quickly as possible. As time goes by, program logic, screens, reports, interface files, and so on adapt to the incorrect design and the cost, difficulty, and risk of changing it become greater.
The best thing to do next is to correct the problem. This requires some kind of authority, and shows why MDM really needs strong governance to be successful. The problem is that governance for MDM cannot be implemented “bottom-up”. Data administration cannot merely suggest that parent reference data tables be implemented every time an orphan code column is detected.
Senior management often agrees with the overall idea of data governance for master data, without a clear understanding of what exactly is involved. Data administration usually cannot provide detailed descriptions either, because of its past focus on the logical level. Indeed, I have spoken to data administrators who are essentially waiting for senior management to tell them how to do governance. This is not going to happen. It is the responsibility of data administration to propose precise processes to deal with specific MDM governance issues. The need to create parent code tables is one such precise issue.
Data administration therefore needs to develop a process for how to deal with each individual governance requirement of reference data, and indeed master data as a whole. There is no single abstract or generalized “governance process” that can be put in place and expected to work for everything.
Governance for MDM is Different
As noted above, the first requirement for fixing the problem of orphan codes is to detect it. Realistically this can only be done by data profiling on a regular basis across the domain of production data. Yet, this kind of monitoring task is rarely carried out by data administration. Data profiling is typically done today in a project setting as part of source data analysis, e.g. for building a data warehouse. An orientation to projects rather than continuous MDM governance simply lets MDM problems accumulate, making them much more difficult to solve. A project that does source data analysis in the context of a major requirement like building a data warehouse is ill-equipped to solve any governance problem in the production of source reference data.
For this reason, and others, continuous surveillance of the production data landscape is required to detect the problem of orphan reference data code columns appearing in tables. This means that data administration needs to have its own well-designed business processes running on a continuous basis. This, in turn, is incompatible with a model of data administration being a competency center that supports projects which begin and end over time.
Designing the business processes and implementing a supporting infrastructure for MDM governance is itself a form of governance. That is, the processes to design the processes of operational governance are very important. In the case of reference data they should begin by gaining an understanding of the physical data challenges. Besides orphan code tables these may include:
- Semantic coherence: Reference data tables are often little more that buckets holding totally unrelated codes.
- Resolution of overlapping reference data values: Reference data tables often contain code values that overlap, or are identical.
- Elimination of gaps in reference data tables: As a business changes there may be a need to introduce more codes to a table in a controlled fashion.
- Elimination of different levels of reference data: Some codes represent different levels of granularity, like values for “Sector” and “Subsector” in one table.
Beyond this there are many other governance challenges for reference data. These include such diverse requirements as managing the semantics of individual codes, and managing subscriptions to standards bodies. However, it is the challenges around the management of physical data values that are the greatest challenge today for MDM.
ABOUT THE AUTHOR
Malcolm Chisholm, Ph.D., has over 25 years of experience in enterprise information management and has worked in a wide range of sectors. He specializes in setting up and developing enterprise information management units, master data management and business rules. Malcolm has authored two books: Managing Reference Data in Enterprise Databases (Morgan Kaufmann, 2000) and How to Build a Business Rules Engine (Morgan Kaufmann, 2003).