Principles of Modeling: Avoid Redundancy

14 years ago

In 1994, I learned a method for data modeling that is based on three principles. I immediately knew that these principles should embraced by anyone who does any data modeling or process modeling. Or almost any other job, for that matter. I have described these principles in three previous blog posts: the Jargon Principle, the Concreteness Principle, and the Reproducibility Principle. But I have later found that there are more principles and guidelines that are important to keep in mind when modeling.

Avoid Redundancy

I almost hear you think: “Yes, avoid redundancy. Duh! Do you have any more open doors to kick in?”

But there is actually more to this than meets the eye. I’ll start with how I word this guideline, and then comment on it. Note that there are actually three parts to this guideline.

“No data model shall contain any unchecked redundancy.

“Process models should avoid making the processor do redundant actions.

“The modeler shall strive to avoid redundant work for the domain expert.”

As you see, the three parts apply to three different parts of the work of a modeler: the data model, the process model, and the process to create these models.

Redundancy in the data model

Many sources say that there should never be any redundancy in a data model (in an OLTP workload, that is; data warehouses are a completely different game). As you can see in the first part of my guideline to avoid redundancy, my opinion is a bit more nuanced than that – I say that no data model should contain any unchecked redundancy. And yes, that means that I believe redundancy in a data model can be okay – as long as the developer is aware that it exists, has a good reason to introduce or keep it, and makes sure that this is all thoroughly documented.

With the concreteness example in mind, I will immediately give an example of a situation where I think redundancy is not only acceptable, but even necessary. Just think of the database that a bank uses for your checking account. It will probably include a table that has a single row for each account, including information such as the date it was opened, a reference to the account holder, the currency used for the account, etc. And it will also include a table with a row for each transaction, that includes the transaction date, the amount transferred, an optional description, and of course a reference to the account.

Would you include a column for the current balance in the account table? I bet you would! And yet, this column would be redundant – after all, the current balance can easily be calculated by adding together all the values in the transaction table. So if you believe that there should never be any redundancy in a data model, you should not include this current balance column, but instead compute it whenever it’s needed. Nice in theory – but in practice, that would probably cost too much performance. Which is why every sane data modeler would (and should) include the current balance in the account table. But (s)he must also document this, and any other redundancy that (s)he decides to introduce or keep in the data model – for accountability, for review at a later time, and (maybe the most important) to ensure that the QA department will include some thorough tests to thwart the risk that this redundancy leads to inconsistent data.

Redundancy in the process model

The process model is the blueprint for the program. When the process model contains redundancy, the computer has to do the same work more than once. And that is always wrong, right? Well, no. Not always.

For example, imagine a table with information about items that have to be shipped. Three of the columns in that table would be Net weight, Tare (packaging) weight, and Gross weight. If you store all three of them as normal columns, you avoid making the computer do extra work, but at the cost of redundancy in the data model. If you don’t include the Gross weight column, or define it as a computed column, you remove the data redundancy – but now, the computer has to compute it every time you query the table. That’s redundancy in the process model. And in this case, other than in the previous example (the checking account), the cost to compute the Gross weight when the table is queried has so little effect on performance that I (and I hope all other modelers) would immediately chose to use a computed column for Gross weight, or to not include it at all.

This is just one example to illustrate that redundancy in the process model can be okay. But (just as with redundancy in the data model) it should be a conscious and well-documented decision, not an accident. So the modeler “should avoid making the processor do redundant actions” – but not at all cost.

Redundancy for the domain expert

The third part of the guideline to avoid redundancy concerns itself with redundant work for the domain expert, the dialog partner of the modeler. A very simple other way to say this, is that you shall not ask the same question more than once. Nor any different but equivalent question.

But it is no coincidence that this guideline is the last of the three, and that the wording is quite weak (“shall strive”). The correctness of the data and process model is always the most important. Sometimes, the developer already has the answer to a question, but the wording is a bit ambiguous. In such a case, (s)he should always go to the domain expert and make sure to get an unambiguous answer – even if the domain expert may feel this to be redundant.

But don’t go overboard with this! There is no faster way to unemployment than to keep bugging the domain expert with countless superfluous questions. When working without any method, always keep in mind how you expect the domain expert to experience the questions you ask. And when using a method, choose a method that ensures no redundant questions are asked.

Redundancy for the modeler

Maybe you have been wondering why there is no fourth element to the guideline to avoid redundancy, one that applies to the modeler him- or herself. Does this mean that I think that a modeler should just repeat each task two, three, five, or even twenty times?

Sure! Knock yourself out, go crazy!

Well, okay, not really. If you repeat everything twenty times, you will take so much time for your work that you will pretty fast find that you no longer have any work to do.

And yet, I do not include this in the guideline to avoid redundancy. For good reason. Two good reasons, actually. First, it’s not needed – everyone hates repeating the same work over and over again, so no guideline to avoid doing that is ever needed. And why should I include redundancy in a guideline about avoiding redundancy?

But more important – as a result of the Reproducibility Principle, there is actually a good reason to duplicate your work. If you have finished a task and then do it a second time, the result should be the same. If it’s not, you have obviously done something wrong – either the first time you did it, or the second time. If the results are the same, that’s still no guarantee that you made no mistakes, but the chance is a lot lower. So doing a task twice can be a valuable check for errors, rather than redundant work.

Bottom line

The bottom line of this post is pretty simple. Redundancy is pretty bad in almost all situations, but there are always exceptions. Avoid redundancy, but not at all costs.

Hugo Kornelis

Hugo Kornelis is an established SQL Server Community expert. He writes, blogs, speaks, tech edits, and researches, focusing mostly on SQL Server performance and execution plans. He was the technical editor for the third edition of Grant Fritchey's "SQL Server Execution Plans". In 2018 he started a project to document all behaviour of SQL Server execution plans at his website, "the SQL Server Execution Plan Reference" (https://sqlserverfast.com/epr) When not working for the community, Hugo is busy at his day job: freelance database developer/consultant. Hugo has over 20 years of experience on SQL Server in various roles. He has a strong database design background but has since specialized into query tuning and execution plans.