You are here:  Home  >  Data Blogs | Information From Enterprise Leaders  >  Current Article

Identity: Working with Surrogate Keys

By   /  April 14, 2017  /  2 Comments

Click to learn more about author Michael Blaha.

Identity is the means for denoting individual entities, so that they can be found. Identity is a prominent concern for databases because developers must have some way for referring to things. There are two aspects of identity – how to connect records together (internal identity) and how to obtain data in responding to queries (external identity).

Internal Identity

Internal identity deals with connections among records.  Primary keys identify records within tables and foreign keys bind records together. A primary key is a unique combination of one or more attributes that is used to preferentially access the records in a table. A foreign key is a reference to a primary key. There are two basic approaches to primary keys in a data model.

  • Surrogate keys. A system-generated attribute is added to each entity type table and made the primary key. This attribute is often an artificial number. The primary key for each relationship table consists of identifiers from the related entity types.

Surrogate keys have the advantage of being a single attribute, small, and uniform in size. Most relational database managers provide ID generators and allocate identifiers efficiently. Since the primary key is synthetic, it is immutable and there are no updates to foreign key references.

However, surrogate keys can make a database more difficult to read during debugging and maintenance. IDs also complicate database mergers; ID values may contend and need to be reassigned. Surrogate keys should only be used internally in applications and not be displayed to users. Unique combinations of application attributes can still (and should) be enforced — that is the purpose of alternate keys.

  • Natural keys. A unique combination of application attributes is used to identify each entity. The primary key for each relationship table consists of primary keys from the related entity types.

Natural keys have different trade-offs. Primary keys have intrinsic meaning, making it easier to debug the database. A downside is that the value of application attributes can change — such changes must propagate to foreign keys.  Some entity types lack unique application attributes. Some models have a series of dependent entity types that lead to unwieldy multi-attribute primary keys.

We recommend that you use surrogate keys for relational databases with more than 30 entity types. The resulting uniformity and simplicity outweighs any additional debugging effort. Both surrogate keys and natural keys are viable options for small applications. Data warehouses should always use surrogate keys.

External Identity

External identity is the ability to start from outside a database, specify the value of one or more attributes, and find data. Alternate keys are important because they are unique application values. Queries often start with alternate keys, natural keys, or sometimes non-unique values, and navigate the connections of internal identity to find desired data.

In SQL queries most joins traverse the connections of internal identity, binding foreign keys to their primary keys. In contrast where clauses specify external identity – the starting points of attributes for traversal – and other constraints.

In Conclusion

Pay close attention to identity as you construct data models. Normally, you should use a surrogate key as the primary key of each entity type table. It’s also important to define referential integrity so that your foreign key references really do exist. Otherwise you may have dangling references that cause you to lose data in queries. It’s also important to define alternate keys. Alternate keys provide entry points to a data model so that users can specify application values and obtain the data that they need.

About the author

Michael Blaha is a consultant and trainer who specializes in conceiving, architecting, modeling, designing and tuning databases. He has worked with dozens of organizations around the world. Blaha has authored seven U.S. patents, seven books many articles, and two video courses. His most recent publication is the Agile Data Warehouse Design video course from Infinite Skills. He received his doctorate from Washington University in St. Louis, and is an alumnus of GE Global Research in Schenectady, New York. You can find more information with his LinkedIn profile or at superdataguy.com.

  • Steve Firebaugh

    Thank you Michael Blaha for another interesting article. We faced these issues for years, and fought bugs based on duplicated or non-canonical natural keys. Today, for our database storage, the use of GUIDs in every table as a surrogate, primary key has eliminated those problems. Joins are always successful and always unambiguous. Whoever is responsible for GUIDs should get a Turing Award — they are a brilliant addition to our field.

    • Michael Blaha

      Thanks Steve for your comments. In the past I encountered folks who argued for natural keys for internal identity. I see less of that nowadays.

You might also like...

5 Pitfalls to Avoid from Real Stories of Analytics Projects (Part 2)

Read More →