You are here:  Home  >  Data Blogs | Information From Enterprise Leaders  >  Current Article

Metadata: Keeping track of lists of values

By   /  May 28, 2011  /  2 Comments

by David Plotkin

As you build metadata expertise, document definitions in your metadata repository (or whatever tool you use) and gain credibility in the enterprise, you are likely to find that your effort includes gathering, stewarding, documenting, and providing lists of valid values (sometimes called “enumeration lists”). After all, having standardization for your name suffixes, gender, marital status, relationship type, and so on ensures that new systems (and existing systems that can handle the modification) will use the same values. This leads to standardization on reports and in the data warehouse as well.

I think most would agree that items like gender code should be tracked by Data Governance (or whatever you call it) and even consider the Metadata Repository the “system of record” for such value lists. These lists have wide usage, a short list of values that don’t (or shouldn’t) change often, and no real clear owner or system of record.

But how do you draw the line between “true” valid lists of values and the vast number of data elements that happen to have a finite list of values, but should be neither governed by Data Governance nor should be documented in the Metadata Repository? Items that potentially fall into this category are things like GL Account codes, office location codes, sale rep identifiers, and even Employee Ids. And trust me, you do need to draw that line, as analysts and project team members will start asking Data Governance for this information once they understand what we do. After all, its easier than trying to dig this stuff out for themselves! The key differentiator seems to be that data elements that are created as part of a common business process with a clear business function owner should NOT be part of the Data Governance deliverables. All of these examples fall into that category. For example, HR creates (and terminates) Employee Ids as part of the hiring and termination process. They are rightfully in control of that process, and the value set changes daily (and even continuously). No one with any sense would suggest that this constantly fluctuating value set belongs in the Metadata repository, or that the Metadata repository should be the system of record. There is a clearly defined system of record — the HR system which uses these values to do the processing required for employees — such as establishing their managerial structure, setting their service date, getting them paid, tracking their taxes and withholding, disciplinary actions, change of status, location, and so on. The same can be said of the other examples noted.

Note that this doesn’t imply that the maintenance/add/removal of the values is limited to the system of record or is simple to administer. Adding a Sales Rep Id, for example, involves not only adding it to the Sales system (probably the system of record), but the HR/Compensation system, establishing the location (which can change from day to day or even hour to hour), and so on. They key, as I said, is that a common business practice with a business function owner owns this process, and the system of record is highly likely to be the main system in which the value set is adjusted initially (with propagation as necessary) and which cannot function properly without having the most up-to-date list of these values.

A key point here is that many times, true “valid values” (such as gender code) don’t have a well-defined system of record. You might make a case that Gender Code is “people data” and thus owned by the HR business function (and so the system of record should be the HR system). But what about all the people the Enterprise deals with who are not employees or contractors? Customers, suppliers, external agents, etc. “Solving” this by putting the data element into a generalized function like Customer Master (with a domain data steward) establishes ownership but does nothing for solving the issue of the SOR for these data elements. Most of the time, these values are used so generally across the enterprise that it is prudent to have an agreed-upon list documented in an easy-to-find place. The list of potential values is so small that it is reasonable and convenient to record and maintain the list in the Metadata repository, though it must be implemented identically (good luck with that) in every system which contains the data element.

To tell the truth, this whole discussion came up because of Product, and whether there should be a list of product codes supplied by Data Governance and kept for reference in the Metadata Repository. I have to admit, my initial inclination was to specify the codes and keep them in the repository, though not as the system of record. After all, we only have about 25 products, and we don’t add a lot of new ones very fast, since it takes a major effort to do that. And to reiterate, the system of record(s) has to be the product systems themselves because that is where you need to fully define the product in order for the system to work correctly (and enable you to sell the product). The fact that a bunch of other systems have to get major updates as well is more a failing of the integration and system design than anything else.

I have since changed my mind about even keeping product in the Metadata repository. To see the apparent insanity of recording the list of products in the Metadata Repository, generalize to take the example of a major retailer, such as Longs Drugs, where I worked for 7 years. Longs (now CVS) has a well-defined product hierarchy, each level is clearly specified and ALL products must fit into the hiearchy and populate under values all the way to the top. The hierarchy is a pyramid (as pretty much all  hierarchies are), with just a few values at level 2, 3, and 4. However, by the time you get down to the next-to-last level (SKU), the values have ballooned to well over 100,000; and at the bottom level (UPC), the list of values are numbered in the millions. This is because every single item that Longs sold had a separate product identifier (UPC) which differentiates by package size (8 oz. vs. 12 oz. of Gelatin), flavor (grape or cherry), brand (Jello or Royal), and even type of packaging (single boxes versus six-packs/bundles). And more.

Given all, this, it is clear that the Metadata repositoryis NOT the system of record for any part of the product hierarchy (that also has not changed from my initial dialog). But does it make sense to record the products in the repository? I would now say that it does not.  Basically, the list will be out of date almost immediately, because the business processes to keep it updated doesn’t exist — and really doesn’t need to.

However, just to be clear, I do think that the definition (the levels, what they are called, what they mean) of product hierarchy ITSELF (and probably any other hieararchy) SHOULD be documented in the Metadata repository.  If there is no agreement on the hierarchy, then reports based on the disparate hierarchies will not match. In addition, it is a very bad idea to have different versions/definitions of what is meant by “product” (the bottom level of the hierarchy). If one group defines product as (for example) an auto insurance policy, while another breaks it down to a specific set of coverages in an auto policy, then not only will reports not work properly, but the very systems that enable working with those products will not work the same and will require people using the systems to know the idiosyncracies of something as basic as what the product is. While little can be done when working with legacy systems, new systems should be designed with a common, governed, list of products and a common hierarchy.

So, that’s it! As always, comments are much welcomed, especially from those of you who have fought this fight before.

About the author

David Plotkin is an Advisory Consultant for EMC, helping clients implement or mature Data Governance programs in their organizations. He has previously served in the capacity of Manager of Data Governance for the AAA of Northern Ca, Nevada, and Utah; Manager of Data Quality for a large bank; and Data Administration Manager at a drug store chain. He has been working with data modeling, data governance, metadata and data quality for over 20 years. He serves as a subject matter expert on many topics around metadata, data governance, and data quality, and speaks often at industry conferences.

  • John Biderman

    This is a thought-provoking post. I’d like to offer a couple nuanced different points of view. I think unique IDs, like an Employee ID, assigned in a system of record indeed would not be enumerated in your metadata documentation (and in fact may have security issues that would preclude it from being exposed). But there are many cases when such IDs, I think, do need governance. For example, what about when so-called sales rep IDs actually represent a book of business that gets transferred from individual to individual, and the actual record of the person who made sales at a point in time is thus lost? I think some governance, and enterprise understanding, of what a “sales rep ID” actually means is required. “Product” is so loaded with ambiguity that there may be several types of “product codes” in an enterprise at different levels of the product hierarchy (as you describe), and understanding what they all mean, and under what circumstances a new one should be generated — i.e. when a change may get introduced in the product taxonomy — does require governance, in my opinion.

    But I agree that the metadata repository is not a system of record for most of these things. (A reference data hub – virtual or physical – is more appropriate for this purpose.) However, keeping the enumerations documented in the repository as they change is a difficult task. We did a neat little trick for our data warehouse users: we created a little Java application outside the repository that can be invoked while you are browsing the metadata for a reference code set. The app queries the associated data warehouse reference table in real time and spawns a little browser window in which you can scroll and search the code list. In other words, the documentation of the list of values is rendered dynamically and is self-maintaining. Cool, huh?

  • David Plotkin

    Hi John,
    Thanks for your thoughts, I always appreciate your input. Sorry it took me so long to get back to you, but apparently Dataversity doesn’t email the author when someone leaves a comment, so I stumbled onto it quite by accident.

    I only have one problem with your comment — I can’t see how we disagree! I completely agree (and said as much, if indirectly) that even when the individual values themselves don’t fall into the bailiwick of Data Governance, these values are generated according to RULES, and rules require governance. For example, new products are generated according to rules, including what the hierarchy looks like and where the product and its hierarchy levels fit in. There were many cases at AAA where Data Governance had to step in as products were being defined that did not fit the stated rules, and these occurances causes all kinds of havoc because systems and data warehouses were designed assuming that the rules would be followed.

    We too have experimented with (and implemented) dynamically documented lists of values. In our case, there was a standardized set of tables that mapped the source system values to “harmonized” values used in any integrated data layer. These tables didn’t live within the repository, but could be called and viewed from there. They were also accessible to the ETL jobs making the transformations and enforcing the validity of the values during the warehouse load. Finally, these tables could be used by a GUI front-end that enabled the owning steward to adjust the values, descriptions, and even add date effectivity to the values where needed. All of this integrated the setting and enforcing of values in the business where it belonged, and placed the responsibility for keeping things up to date with the business data stewards.

You might also like...

Eight Launches AI-Powered Personal Sleep Coach

Read More →