When I started blogging here on sqlblog.com, I intended to write about stuff like T-SQL, performance, and such; but also about data modeling and database design. In reality, the latter has hardly happened so far – but I will try to change that in the future. Starting off with this post, in which I will pose (and attempt to answer) the rather philosophical question of the title: is data modeling an art or a science?
Before I can answer the question, or at least tell you how I think about the subject, we need to get the terms straight. So that if we disagree, we’ll at least disagree over the same things instead of actually agreeing but not noticing. In the next two paragraphs, I’ll try to define what data modeling is and how it relates to database design, and what sets art apart from science in our jobs.
Data modeling and database design
When you have to create a database, you will ideally go through two stages before you start to type your first CREATE TABLE statement. The first stage is data modeling. In this stage, the information that has to be stored is inventoried and the structure of that information is determined. This results in a logical data model. The logical data model should focus on correctness and completeness; it should be completely implementation agnostic.
The second stage is a transformation of the logical data model to a database design. This stage usually starts with a mechanical conversion from the logical data model to a first draft of the implementation model (for instance, if the data model is represented as an ERM diagram but the data has to be stored in an RDBMS, all entity and all many-to-many relationships become tables and all attributes and 1-to-many relationships become columns). After that, optimization starts. Some of the optimizations will not affect the layout of tables and columns (examples are choosing indexes, or implementing partitioning), but some other optimizations will do just that (such as adding a surrogate key, denormalizing tables, or building indexed views to pre-aggregate some data). The transformation from logical data model to database design should focus on performance for an implementation on a specific platform. As long as care is taken that none of the changes made during this phase affect the actual meaning of the model, the resultant database design will be just as correct and complete (or incorrect and incomplete) as the logical data model that the database design is based on.
Many people prefer to take a shortcut by combining these two stages. They produce a data model that is already geared towards implementation in a specific database. For instance by adding surrogate keys right into the first draft of the data model, because they already know (or think) that they will eventually be added anyway. I consider this to be bad practice for the following reasons:
- It increases the chance of errors. When you have to focus on correctness and performance at the same time, you can more easily lose track.
- It blurs the line. If a part of a data model can never be implemented at acceptable performance, an explicit decision has to be made (and hopefully documented) to either accept crappy performance or change the data model. If you model for performance, chances are you’ll choose a better performing alternative straight away and never document properly why you made a small digression from the original requirements.
- It might have negative impact on future performance. The next release of your DMBS, or maybe even the next service pack, may cause today’s performance winner to be tomorrows performance drainer. By separating the logical data model from the actual database design, you make it very easy to periodically review the choices you made for performance and assess whether they are still valid.
- It reduces portability and maintainability. If, one day, your boss informs you that you need to port an existing application to another RDBMS, you’ll bless the day you decided to separate the logical data model from the physical database design. Because you now only need to pull out the (still completely correct) logical data model, transform again, but this time apply optimization tricks for the new RDBMS. And also, as requirements change, it is (in my experience) easier to identify required changes in an implementation-independent logical data model and then move the changes over to the physical design, than to do that all at once if only the design is available.
- It may lead to improper choices. More often that I’d like, I have seen good modelers fall victim to bad habits. Such as, for instance, adding a surrogate key to every table (or entity or whatever) in the data model. But just because surrogate keys are often good for performance (on SQL Server that is – I wouldn’t know about other DBMS’s) doesn’t mean they should always be used. And the next step (that I’ve witnessed too!) is forgetting to identify the real key because there already is a key (albeit a surrogate key).
Art and science
For the sake of this discussion, “art” (work created by an artist) is the result of some creative process, usually completely new and unique in some way. Most artists apply learned skills, though not always in the regular way. Artists usually need some kind of inspiration. There is no way to say whether a work of art is “good” or “bad”, as that is often in the eye of the beholder – and even if all beholders agree that the work sucks, you still can’t pinpoint what exactly the artist has done wrong. Examples of artists include painters, composers, architects, etc. But some people not usually associated with art also fit the above definition, such as politicians, blog authors, or scientists (when breaking new grounds, such as Codd did when he invented the relational model). Or a chef in a fancy restaurant who is experimenting with ingredients and cooking processes to find great new recipes to include on the menu.
“Science” does NOT refer to the work of scientists, but to work carried out by professionals, for which not creativity but predictability is the first criterion. Faced with the same task, a professional should consistently arrive at correct results. That doesn’t imply that he or she always will get correct results, but if he or she doesn’t, then you can be sure that an error is made and that careful examination of the steps taken will reveal exactly what that error was. Examples of professionals include bakers, masons, pilots, etc. All people that you trust to deliver work of a consistent quality – you want your bread to taste the same each day, you want to be sure your home won’t collapse, and you expect to arrive safely whenever you embark a plane. And a regular restaurant cook is also supposed to cook the new meal the chef put on the menu exactly as the chef instructed.
Data modeling: art or science?
Now that all the terms have been defined, it’s time to take a look at the question I started this blog post with – is data modeling an art or a science? And should it be?
To start with the latter, I think it should be a science. If a customer pays a hefty sum of money to a professional data modeler to deliver a data model, then, assuming the customer did everything one can reasonably expect1 to answer questions from the data modeler, the customer can expect the data model to be correct2.
1 I consider it reasonable to expect that the customer ensures that all relevant questions asked by the data modeler are answered, providing the data modeler asks these questions in a language and a jargon familiar to the customer and/or the employees he interviews. I do not consider it reasonable to expect the customer (or his employees) to learn language, jargon, or diagramming style used by data modelers.
2 A data model is correct if it allows any data collection that is allowed according to the business rules, and disallows any data collection that would violate any business rule.
Unfortunately, my ideas of what data modeling should be like appear not to be on par with the current state of reality. One of the key factors I named for “science” is predictability. And to have a predictable outcome, a process must have a solid specification. As in, “if situation X arises, you need to ask the customer question Y; in case of answer Z, add this and remove that in the data model”. Unfortunately, such exactly specified process steps are absent in most (and in all commonly used!) data modeling methods. However, barring those rules, you have to rely on the inspiration of the data modeler – will he or she realize that question Y has to be asked? And if answer Z is given, will the modeler realize that this has to be added and that has to be removed? And if that doesn’t happen, then who’s to blame? The modeler, for doing everything the (incomplete) rules that do exist prescribe, but lacking the inspiration to see what was required here? Or should we blame ourselves, our industry, for allowing ourselves to have data modeling as art for several decades already, and still accepting this as “the way it is”?
Many moons ago, when I was a youngster that had just landed a job as a PL/I programmer, I was sent to a course for Jackson Structured Programming. This is a method to build programs that process one or more inputs and produce one or more outputs. Though it can be used for interactive programs as well, it’s main strength is for batch programs, accessing sequential files. The course was great – though the students would not always arrive at the exact same design, each design would definitely be either correct, or incorrect. All correct designs would yield the same end result when executed against the same data. And for all incorrect designs, the teacher was able to pinpoint where in the process an error was made. For me, this course changed programming from art into science.
A few years later, I was sent to a data modeling course. Most of the course focused on how to represent data models (some variation of ERM was used). We were taught how to represent the data model, but not how to find it. At the end, we were given a case study and asked to make a data model, which we would then present to the class. When we were done and the first model was presented, I noticed some severe differences from my model – differences that would result in different combinations of data being allowed or rejected. So when the teacher made some minor adjustments and then said that this was a good model, I expected to get a low note for my work. Then the second student had model that differed from both the first and my model – and again, the teacher mostly agreed to the choices made. This caused me to regain some of my confidence – and indeed, when it was my turn to present my, once again very different, model, I too was told that this was a very good one. So we were left with three models, all very different, and according to the instructor, they were all “correct” – and yet, data that would be allowed in the one would be rejected by the other. So this course taught me that data modeling was not science, but pure art.
This was over a decade ago. But has anything changed in between? Well, maybe it has – but if so, it must have been when I was not paying attention, for I still do not see any mainstream data modeling method that does provide the modeler with a clear set of instructions on what to do in every case, or the auditor with a similar set of instructions to check whether the modeler did a great job, or screwed up.
Who’s to blame?
Another difference between art and science is the assignment of blame. If you buy a bread, have a house built, or embark on a plane, then you know who to blame if the bread is sour, if the house collapses, or if the plane lands on the wrong airport. But if you ask a painter to create a painting for your living and you don’t like the result, you can not tell him that he screwed up – because beauty is truly a matter of taste.
Have you ever been to a shop where all colors paint can be made by combining adequate amounts of a few base colors? Suppose you go to such a shop with a small sample of dyed wood, asking them to mix you the exact same color. The shopkeeper rummages through some catalogs, compares some samples, and then scribbles on a note: “2.98 liters white (#129683), 0.15 liters of cyan (#867324), and 0.05 liters of red (#533010)”. He then tells you that you have to sign the note before he can proceed to mix the paint. So you sign. And then, once the paint has been mixed, you see that it’s the wrong color – and the shop keepers then waves the signed slip of paper, telling you it’s “exactly according to the specification you signed off”, so you can’t get your money back. Silly, right?
And yet, in almost every software project, there will be a moment when a data model is presented to the customer, usually in the form of an ERM diagram or a UML class diagram, and the customer is required to sign off for the project to continue. This is, with all respect, the same utter madness as the paint example. Let’s not forget that the customer is probably a specialist in his trade, be it banking, insurance, or selling ice cream, but not in reading ERM or UML diagrams. How is he supposed to check whether the diagram is an accurate representation of his business needs and business rules?
The reason why data modelers require the customer to sign off the data model is, because they know that data modeling is not science but art. They know that the methods they use can’t guarantee correct results, even on correct inputs. So they require a signature on the data model, so that later, when nasty stuff starts hitting the fan, they can wave the signature in the customer’s face, telling him that he himself signed for the implemented model.
In the paint shop, I’m sure that nobody would agree to sign the slip with the paint serial numbers. I wouldn’t! I would agree, though, to place my signature on the sample I brought in, as that is a specification I can understand. Translating that to paint numbers and quantities is supposed to be the shopkeepers’ job, so let him take responsibility for that.
So, I guess the real question is … why do customers still accept it when they are forced to sign for a model they are unable to understand? Why don’t they say that they will gladly sign for all the requirements they gave, and for all the answers they provided to questions that were asked in a language they understand, but that they insist on the data modeler taking responsibility for his part of the job?
Maybe the true art of data modeling is that data modelers are still able to get away with it…
10 Comments. Leave new
I’m not sure that it is possible to make data modelling entirely scientific but agree that steps could and should be taken to introduce more rigour into it.
And, yes, users should only ever sign-off the requirements and the answers they give to the data modellers.
Hugo,
Great coverage of this topic!
One item you didn’t mention is that subsequent lifecycle changes to the database design need to be applied as appropriate to both the logical and physical data models. If not, a later derivation of a physical data model for a different DBMS from the logical data model may be out of sync with the current physical data model implementation, since the logical data model is out of sync with the current implementation physical data model. I have often seen the best of intentions in this area – starting with a logical data model during initial application design and deriving an initial physical data model – only to have all later application lifecycle database design changes implemented only in the physical data model.
As you mentioned, not all aspects of the physical data model need to be reflected in the logical data model (such as indexes and partitioning, which are physical-only issues), but other database design changes need to be reflected in both data models.
Keeping separate models in sync has historically been a challenging and effort- and time-consuming application lifecycle maintenance goal. Some tools can help with this effort. Hopefully, tools will get better at this over time. But it still comes down to the discipline of the data / database architects to value both models and make it happen.
Scott R.
Scott R. wrote: "One item you didn’t mention is that subsequent lifecycle changes to the database design need to be applied as appropriate to both the logical and physical data models"
That’s a VERY good point – and one where many tool vendors have completely dropped the ball.
I’m sure that everyone who has ever dabbled in data modeling will have used tools that are able to generate the DDL for a database from the finished model. This is great – for a first release. But in practice, the first release is never the last, and what good is updating the model in the tool if all the tool can generate is a bunch of CREATE TABLE statements? The tables already exist from the first release, and are filled with millions of rows of customer data right now; the LAST thing one wants to do is to drop and recreate them!
This is one of the main reasons why so many tools are bought, installed, used for a first release … and then end up being abandoned and forgotten. Keeping the model current becomes an extra chore if the changes can’t be automatically propagated to the deployed database; it ends up as all documentation: "We’ll update it later; right now we have a deadline to catch but after that, we will reserve time to change the documentation. Honest!"
nice article
makes me think about discussions long time ago, is programming art of science
style matters, elegance matters, you like it or you don’t, so it’s an art
but with one dificultie, in the end it has to work, like science is
the difference might be the fact that inspiration is less important, the idea about what is needed does not come from the programmer, and, as far as I am concerned, that’s what is making art art, the idea
Hugo;
Some interesting points you raise.
Your experience with the class where three different models were presented that all appeared to meet the requirements is exactly in line with the research that Simsion has done on data modeling (see "Data Modeling Essentials" by Simsion and Witt).
In short, given a particular conceptual model, there are 1 or more logical models that will satisfy the requirements specified therein. The choices that go in to making that logical design involve myriad and often conflicting concerns. This is where the "art" happens. Optimistically, though, if we were able to encode these design concerns and parameterize them, it would remove much of the subjectivity in the conceptual-to-logical mapping.
Your paint analogy is nice in that it points out exactly where we sometimes go wrong as data professionals. The subject matter experts should not be asked to review logical models, but rather the conceptual model that informs them. The conceptual model should completely represent the business rules at play in the domain of interest, and there can really be only one "correct" conceptual model (for any differences in candidate conceptual models directly map to differences in understanding of the problem domain, and must be resolved before automation).
TroyK
Hugo,
You say, about a logical data model. "The logical data model should focus on correctness and completeness; it should be completely implementation agnostic."
First: What does it mean?
(I’ve looked in the wikipedia what "Logical correctness" means, but somehow I’m not able to apply it to logical Data modeling.)
Second: How can I achieve this?
Mabye you know some books that can explain this in simple terms.
Hi Arjen;
I’m sure Hugo can expand on this a bit, but the short answer to your question is that a logical model should be absent any constructs that are specific to the RDBMS of your choice. In other words, a logical model need not be changed if you implement it on SQL Server or Oracle (or, perhaps more realistically, SQL Server 2000 or SQL Server 2008).
When you move to the physical implementation, you consider things like DBMS-specific datatypes (HierarchyID, the new Date and Time datatypes), performance-based tradeoffs, etc.
It’s important to remember that for any single conceptual model (which must be "correct", at least with respect to the current understanding of the problem domain), there are multiple possible "correct" logical models, even if we restrict them to relational.
TroyK
Hi TroyK,
I understand that the logical data model differs from the physical data model. It’s on a higher abstraction level and leaves out any information about physical storage.
But correctnes and completnes are terms which ar also used in the scientific field of ‘Logic’. The have a certain meaning there, an I was wondering if they are the same terms here.
We try to focus on all the entities in the domain and check of it is all there, if so we consider it complete . We also check of it looks satisfaible and if we are okee with it we consider it correct. If this is general the meaning of completeness and correctness no further questions ask, but maybe Hugo is refering to something mmore subtle. Something you always should do to improve the quallity of the model.
It feels a bit like you say "you should always do normalization" and I have never heard of "normalization" in the context in database design. (Yes, a shame on me if it were true.)
It’s all about the phrase "you should focus on …". It says "here comes something really important" and then it’s followed by these two terms "correctness" and "completness". No idea what he means by these terms or if they are used in some theorethical area. If you think that it’s important to mention it, it would be nice to give some explanaition. A simple link will do. Just like de link under ERM.
Personally, I think if you go a bit acadamic you narrow your group of readers, that already knows what it’s about. If that’s the purpose it’s oke, but if the purpose is to make designers more aware that there is more out there to improve the quallity of the design in general it won’t hurt to add a bit more explaining.
Interesting article though the details don’t do the subject justiice. In many instances, the thing we call science has a bit of art mixed in. Changing order of adding elements during a chemistry process can alter the outcome. Virtually every "science" experiment will have varying results when done by different individuals. The real trick is in understanding the variables, standardizing the variables, and explaining or removing the variables.
Likewise, data modeling is a "disapline" requiring the same attention to detail. In your example, the shopowner should and usually does the shop sample is a known enity. The components that result in a particular color have been measured over and over. The most important part of this process is getting the customer to confirm the store sample and his sample are an exact match.
Early in my career, I discovered a universal truth. That truth being a person will always tell you what the want, but seldom tell you what they need. If you give ’em what they want, they will always be asking for more. If you give ’em what they need, they will be happy. Therefor it is the data modeler’s responsibility to communicate with the business expert to separate the wants from the needs. During the all important model review, it is up to the data modeler to present the model in a manner that doesn’t overwhelm the business representative. The data modeler does not need to impress the business with the comples representation of all his/her hard work. That would be counter-productive and like showing the shop customer the component index numbers; unneccessary static.
I have interviewed many candidates for data modeling positions. An alarming number have little to no clue of what it takes to successfully design, model and present a database.
Is data modeling an art or science? Currently, it is a disapline. Someday, I believe it will evolve into a science, but for now it is a bit of both. Remember, a little more than a hundred years ago engineering was not a science; a lot of trail and error – kind of like data modeling.
For me data models are more a science kind of thing. Especially with logical data models. But maybe that is because I work as an enterprise architect myself. Interesting article, I have to say!
Regards, Felix