I have a basic CRUD web app where people can create articles/edit them. I now want to add the ability to keep revision histories of all edits. Currently, I have an Articles table that looks like this:
Article(id, title, content, author_id, category_id, format)
I have considered 2 options for changing my current schema to add support for revision history. Basic idea is every single edit for any article is stored as a record in a Revision table. So Articles and Revisions is a One-to-many relationship.
1st option (normalized): One table for article metadata, one for revisions. No duplicate data stored.
Article(id, title, category_id)
Revision(id, content, author_id, format)
2nd option (de-normalized): Two tables like option 1 but with some duplicate columns.
Article(id, title, content, author_id, category_id, format)
Revision(id, article_id, content, author_id, format)
I'm thinking of going with the 2nd option because it will make my coding much easier (less complex, less lines of code). I know it isn't "academic" and "pure" but my personal feeling is that having to do extra joins would hurt code maintenance. Also, performance should be better since not as many joins will have to be done.
Is this a sound way to go about this task? Possibly any unforeseen or long-term consequences I am overlooking?
If you care about your data, you will not end up with less code in "denormalized" case - you'll have to enforce that the latest row in Revision always matches the copy in Article. This is actually far from trivial in the concurrent environment - you'll have to do your locking very carefully!
(If you choose Revision and Article not to contain the same copy, then this is even worse - you won't be able to rely on DBMS for enforcing the Revision primary key!)
With a DBMS powerful enough, you could have your cake and eat it too - for example, Oracle materialized views can "pre-JOIN" the data for you without any need for denormalizing the actual data model.
Even if you don't have such a DBMS, consider denormalizing only after you have measured the performance on realistic amounts of data. Yes, JOINS can be expensive, but are they too expensive in your particular situation? Only measurements can tell.
BTW, consider using identifying relationship / natural key like this:

The revision_no grows monotonically as you add revisions under the given article.
The B-Tree structure underneath the Revision PK makes it very efficient to find the latest (or any!) revision of the given article. Unless you have alternate keys not shown in your question, you could also cluster the Revision and (under Oracle) even compress the leading edge of the clustering index, so space overhead from repeating article_id is annulled.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With