To normalize or not to normalize, that is the question
July 17, 2008 Leave a comment
Jeff Atwood published a very interesting post “Maybe Normalizing Isn’t Normal” where he delves into whether you should normalize or denormalize. Be sure to check the comments, or this summary on the High Scalability blog if the number of comments (and tone in some cases) gives you a headache.
The post is very interesting but I took issue with this:
Both solutions have their pros and cons. So let me put the question to you: which is better — a normalized database, or a denormalized database?
Trick question! The answer is that it doesn’t matter! Until you have millions and millions of rows of data, that is. Everything is fast for small n. Even a modest PC by today’s standards — let’s say a dual-core box with 4 gigabytes of memory — will give you near-identical performance in either case for anything but the very largest of databases. Assuming your team can write reasonably well-tuned queries, of course.
While it is true that for small data sets there is no difference in performance whether you normalize you schema or not, it will make a huge difference once your data set grows. Adding to the fun is that changing your schema becomes more and more difficult as the data set grows.
Then things settle down:
First, a reality check. It’s partially an act of hubris to imagine your app as the next Flickr, YouTube, or Twitter. As Ted Dziuba so aptly said, scalability is not your problem, getting people to give a shit is. So when it comes to database design, do measure performance, but try to err heavily on the side of sane, simple design. Pick whatever database schema you feel is easiest to understand and work with on a daily basis. It doesn’t have to be all or nothing as I’ve pictured above; you can partially denormalize where it makes sense to do so, and stay fully normalized in other areas where it doesn’t.
A sane, simple design is a “good thing”, but you also need to plan for the future, you want a sane simple design which can evolve and scale.
Finally sanity is restored:
Pat Helland notes that people normalize because their professors told them to. I’m a bit more pragmatic; I think you should normalize when the data tells you to:
- Normalization makes sense to your team.
- Normalization provides better performance. (You’re automatically measuring all the queries that flow through your software, right?)
- Normalization prevents an onerous amount of duplication or avoids risk of synchronization problems that your problem domain or users are particularly sensitive to.
- Normalization allows you to write simpler queries and code.
In my experience (with Feedster amongst others), a heavily denomalized schema is easy to work with but simply does not scale well.
With my current project I took a different tack:
- Normalize where it makes sense and group logical chunks of data together, even if it means having 1 to 1 relationships. From a performance point of view this means that you get and update the chunks you need rather than accessing tables with 50+ fields were 90% of the fields are null (don’t laugh, I have seen it happen).
- Never ever ever join to get data, better to issue two simple queries rather than one join. With the caveat that this is born of experience with MySQL and large amounts of data (1/2 TB), even with indices performance can be unpredictable.
- Sharding your data is pretty much the only way to scale, so design that in from the start.
- Build a data access layer which hides the schema from the application.
I am sure there is more, but this is a start.