Primary keys are a major component of formal data resource design. However, the designation and use of primary keys is often informal based on physical implementation. The result is an increase in data disparity and difficulty navigating an organization’s data resource.
A recent online discussion posed the question about whether there could be an alternate for a surrogate key. The discussion was interesting, because surrogate means substitute. If a surrogate key is a substitute, then there must be a more important key for which the surrogate key is a substitute. Having an alternate for a surrogate key means that it would be an alternate to a substitute for a more important key. The question becomes What is that more important key?
The more important key is, of course, a primary key meaningful to the business. Logically, a primary key is one or more data characteristics whose value(s) uniquely identify each data occurrence in a data subject. Physically, a primary key is one or more data items whose value(s) uniquely identify each data record in a data file. In a relational database, a primary key is one or more columns whose data value(s) uniquely identify a row in a table.
A logical primary key meaningful to the business is critical for properly normalizing the data. A physical primary key and its corresponding foreign key are critical for properly navigating a database to store and retrieve data. What is happening in many situations is that the brute-force-physical development of databases leads to the designation of physical surrogate keys without the proper designation of a logical primary key. Avoiding formal logical design before physical design avoids designating the proper primary key and formal data normalization. The result is extremely poor data resource design, difficult database navigation, and an increase in data disparity.
For example, a data file could be defined for rivers, vegetation, and vehicles with a system counter as a surrogate key. Any normalization performed on the data would indicate proper normalization. However, a business professional would look at the data and say that normalization failed because rivers, vegetation, and vehicles have their own primary keys meaningful to the business, showing they do not belong in the same data file. Specifically, rivers are uniquely designated by the USGS River Reach Number, vegetation is uniquely identified by the scientific name, and vehicles are uniquely identified by the manufacturers Vehicle Identification Number.
The problem continues. Another data file containing vehicles and buildings is designed with a its surrogate key. Yet another data file containing buildings and roads is designed with its surrogate key. The practice continues throughout the organization resulting is huge quantities of disparate data. The problem is referred to as the primary key fiasco where databases are built with physical surrogate keys without the benefit of proper normalization using business meaningful primary keys.
The primary key fiasco is resolved with the implementation of seven different classifications for primary keys.
1. Primary key composition indicates the number and nature of the data characteristics forming a primary key. Primary key composition can be simple, compound, or complex. A simple primary key contains one home data characteristic in its home data subject. A compound primary key contains multiple home data characteristics in their home data subject. A complex primary key contains multiple data characteristics from both the home data subject and one or more foreign data subjects.
2. Primary key temporality indicates whether or not a primary key contains temporal components. A temporal component consists of one or more temporal data characteristics. A temporal data characteristic is any data characteristic that represents a component of chronology, such as seconds, minutes, hours, months, years, decades, centuries, geologic epochs, and so on. A non-temporal primary key is a compound or composite primary key that has no temporal component. Current data do not necessarily need a temporal component in the primary key. A temporal primary key is a primary key is a compound or composite primary key that has a temporal component. Historical data must have a temporal primary key, and the granularity of the temporal data characteristics depends on the temporal relevance for an organization.
3. Primary key meaning indicates whether or not the primary key is meaningful or meaningless to the business. A meaningful primary key is a primary key that is meaningful to the business. A meaningless primary key is a primary key that has no meaning to the business. The terms meaningful and meaningless are used, rather than intelligent and non-intelligent, because primary keys cannot possess intelligence. Inanimate objects, like chairs, hammers, teddy bears, and so on, cannot possess intelligence. Concepts, like bank accounts, cannot possess intelligence. Primary keys are a concept and cannot possess intelligence. Using the term intelligent key contributes to the lexical challenge in data resource management and must be avoided.
4. Primary key origin indicates whether the primary key is inherent to the data occurrences or was assigned within the organization and is not inherent to the data occurrences. A natural primary key is a primary key that is an inherent feature of the data occurrences. It is usually assigned outside the organization and is inherited by the organization. A natural primary key is usually, though not always, a meaningful primary key. An artificial primary key is a primary key that is arbitrarily assigned to the data occurrences by the organization to support their management of the data. An artificial primary key is usually, though not always, a meaningless primary key.
5. Primary key purpose indicates how the primary key is used within the organization. A logical primary key is a primary key that uniquely identifies each data occurrence in a data subject and is used to properly normalize the data during logical data design. It must be meaningful to the business, and is usually natural. A physical primary key is a primary key that is useful for unique identification of data records in a data file and can be used for navigating between data files in a database. It may or may not be meaningful to the business, and can be natural or artificial.
6. Primary key scope indicates the range of data occurrences covered by the primary key. Even though the primary key is supposed to uniquely identify each data occurrence in a data subject, situations do arise where the primary key does not uniquely identify each data occurrence. A general primary key is a primary key that uniquely identifies every data occurrence in a data subject. It is the desirable primary key for a data subject. A limited primary key is a primary key that is available for all data occurrences, but is limited in scope. A specific primary key is a primary key that is not available for all data occurrences in a data subject. It’s only available for a subset of the data occurrences.
7. Primary key status indicates the current state of a primary key as it moves through a development cycle. A candidate primary key is a primary key that has been identified and considered as a primary key, but has not been verified. A preferred primary key is a primary key that has been designated as the preferred or predominant primary key for the data subject. Usually one primary key is designated as preferred, and the others are designated as alternate primary keys. An alternate primary key is a primary key that is valid and acceptable to use, but is not the preferred primary key. An obsolete primary key is a primary key that has no further use and should not be used. It is documented so that people recognize that it’s obsolete and do not perpetuate it’s use.
Foreign keys follow the same classification as their corresponding primary key. However, a foreign key is not typically defined for every primary key that is defined. Logical foreign keys are defined based on the corresponding logical primary key used for data normalization. Physical foreign keys are defined based on the corresponding primary key used for data navigation. Using these seven classifications results in proper data normalization based on logical primary keys and proper data navigation based on physical primary keys.
Data modelers and database managers continue to ask if they really need to do all this stuff. Is all this stuff really necessary? Why can’t databases just be built to support current business needs? Well, it’s not stuff and it needs to be done to formally design and implement a data resource that is not disparate and fully supports an organization’s current and future business activities.
Data management professionals must make every effort to properly design and implement an organization’s data resource. They must stop ignoring formal logical design using data normalization based on meaningful business keys. They must stop brute-force-physical development of databases based on surrogate keys. They must begin identifying and documenting all primary keys based on a formal primary key classification scheme. That is their professional responsibility.