by Karen Lopez
At Enterprise Data World I gave a Lightning Talk, Karen's List of the Most Irritating Normalization Myths. This was a fast-paced, auto-advancing slides presentation with 10 slides covered in five minutes. I covered the types of things I've heard in design reviews that either didn't make sense or repeated some urban data legend. I had only thirty seconds per slide, so it was a lot to cover at an extremely rapid pace. The blogs here at Dataversity.net give me an opportunity to get my rants down in writing, something I may regret later. This sort of risk never stopped me before, though.
What most people know about normalization they learned from word of mouth, much like most of us learned about sex -- in the hallways of school, told as dirty little tales from school kid to school kid. You probably remember just what sorts of data quality that information had.
I first learned about normalization in a real database class in university, where we learned the theory, then developed solutions that worked with both normalized and highly-denormalized data structures in order to best understand the costs, benefits and risks associated with design decisions. I was shocked when I went into the "real" world and found out how many myths there are about what it is, how it works and how it is something to be avoided at all costs. Normally (excuse the pun) I would not care about misinformation that my team members spot, but when their misunderstanding of the foundation theory of database design starts to impact my work, I have to call BS. This blog series focuses on the myths and poorly worded complaints against normalization.
Then this week a prominent database technology expert whom I highly respect wrote a newsletter article about overnormalizing database designs. I know for certain he knows a great deal about database design. Most likely due to space constraints, he also perpetuated one of the myths I want to cover in this series. I still highly respect him. I've been known to make similar statements. I'd bet we all have. Did I mention I highly respect him?
Writing about normalization is error prone because professionals want to use precise terms. A relation isn't a table. A set is not a table. However, when I talk with team mates about normalization, it's sometimes easier and more clear to them to make those sorts of analogies because people can visualize a relation as a table. It's still wrong, but it is more clear. In this series, I will endeavor to be precise and still make analogies with real world artifacts such as tables and databases. I therefore ask the professional purists to grant me some poetic license to do so.
Most people I've asked tell me they learned about normalization in one of these situations:
- In a 1-5 day course normalization was covered right at the start of the course. It sort of made sense, but they didn't really remember what the normal forms are and would have to go back to their notes to figure out what each normal form is.
- In a book they read about the normal forms. They saw how a table was transformed from 1NF to 2NF, etc. There was a brief explanation why normalization was important.
- In a formal education course, where there were 1 or two modules on normalization. Students were required to show a data structure as it was normalized from 1NF to 5NF. There was an exam section on the normal forms where they did a very similar exercise. They remember doing it, but have never taken a data structure through all the normal forms again.
- In a meeting they heard someone complain about a design that was or was not properly normalized.
- In a bar, where a co-worker griped because the database design was horrible because it was over-normalized and therefore had many tables.
Almost all the people I talk with pretty much equate normalization with something evil, as if data architects and database designers conspire to wedge as much normalization as they can in design just to work against developers and DBAs as much as possible. Few (other than data professionals) understand that the main reason we are concerned about normalization in relational databases is to increase data integrity by reducing redundant data and mitigating update anomalies. Notice how nothing in that statement speaks to query performance. That's because normalization is about updating data - creating, updating and deleting data. You could, though, think of normalization as a method for increasing the performance of the data, not the code.
The Normal Forms
In this series I'm not going to cover normalization as a tutorial but I will share this description* from Wikipedia that covers the normal forms.
Third normal form (3NF)
E.F. Codd (1971); see +also Carlo Zaniolo's equivalent but differently-expressed definition (1982) Every non-prime attribute is non-transitively dependent on every candidate key in the table
Every non-trivial join dependency in the table is implied by the superkeys
I have modified the format of the table for compatibility with a variety of platforms and blog readers.
My first irritation, coming up in the next post, is about the love/hate relationship of normalization. Your assignment is to reach deep down into your heart and identify your true feelings about normalization. I'd love to hear about how you learned about normalization and how you came to your feelings about it. Who says normalization is only math?
Your second assignment is to Tweet @kendra_little that you loved her poo art. Thanks, Kendra.
* Wikipedia contributors, "Database normalization," Wikipedia, The Free Encyclopedia,
http://en.wikipedia.org/w/index.php?title=Database_normalization&oldid=433633646 (accessed June 14, 2011).