NULL – The database’s black hole

NULL – The database’s black hole

I once read a scientific article on black holes that started of on the observation that, since a hole is the absence of anything and black is invisible in space, a black hole is in fact an invisible nothing – so what the heck are we talking about? Well, almost the same can be said about NULL in databases.

 

Let’s first take a look at what NULL is supposed to be. Here is the definition of NULL from the SQL-2003 standard:

 

            null value: A special value that is used to indicate the absence of any data value.

 

This definition differs significantly from some common misconceptions about NULL:

 

·        NULL is not the same as the numeric value 0, even though they are pronounced the same in some languages (like Dutch or German).

·        NULL is also not the same as the string value ‘’ (also known as the empty string), even though many Oracle developers would like to believe so. However, they can’t be blamed for this as this is completely Larry Ellison’s fault.

·        NULL is definitely not the same as either of the date values January 1st 1753, January 1st 1900, or December 31st 9999, even though there might be valid (performance related) reasons to use either of those values as a magic value instead of NULL in a specific situation.

·        Moving to the more controversial stuff, NULL does not mean “not applicable”. Of course, a NULL in a table is often a result of the attribute not being applicable for a specific occurrence of the entity stored in the table (e.g., a column “birthday” in a customer table that stores details of businesses as well as humans) – but in other columns and other tables, or even in another row of the same table, there might be a completely different reason for the data being missing (absent)!

·        And saving the best (read: most controversial) for the last, NULL is also definitely not meant to signify “unknown”. Again, a NULL in a table might result from the value being unknown at data entry time (e.g., when we forget to ask a customer for his or her birthday), but there might be other reasons as well. Unfortunately, many text books insist on explaining the behaviour of NULL in expressions by describing NULL as unknown, rather than missing or absent, causing this misconception about NULL to be the most widespread and the hardest to combat.

 

Allow me to dwell some more on this whole unknown/not applicable thing, as it’s able to confuse even highly-appraised scientists – as is aptly demonstrated in “Much Ado About Nothing”, an exchange between Dr. E.F. Codd and C.J. Date about NULL, in which Date gets Codd to agree that there should be more than one kind of NULL. The fact that both Codd and Date apparently missed, is that the examples used in their debate failed to meet the basic rules of normalization! Moving back to the birthday example to illustrate this, it is true that NULL in this column can have different causes – but that doesn’t change the meaning of this NULL, which is limited to “the birthday for this particular customer is not in the database”. If the business doesn’t care why a birthday is missing, then the fact that there might be different causes doesn’t have any consequences on the database. If, on the other hand, the business does case about the reason that a birthday is missing, then this reason should of course be modeled and stored in the database – but not in the same column as the birthday! “Birthday” is one attribute and “Reason birthday is missing” is a different attribute – heck, they even have completely disjunctive domains! The fact that these attributes are mutually exclusive doesn’t warrant violating first normal form by stuffing these two attributes in a single column! In this case, a proper design for a SQL Server table would look like this:

 

CREATE TABLE Customers

       (CustomerID int NOT NULL,

        Birthday datetime NULL,

        ReasonNoBirthday int NULL,

     — Other columns,

        CONSTRAINT PK_Customers PRIMARY KEY (CustomerID),

        —  Time part for birthday has to be midnight

        CONSTRAINT CK_Birthday CHECK

                     (CONVERT(char(8),Birthday,108)=’00:00:00′),

        — Birthday mutually exclusive with ReasonNoBirthday

        CONSTRAINT CK_ReasonNoBirthday CHECK

                     ((Birthday IS NULL AND ReasonNoBirthday IS NOT NULL)

                   OR (Birthday IS NOT NULL AND ReasonNoBirthday IS NULL))

       );

 

Things get more complex when nullable columns are used in expressions and predicates. In a procedural language, this wouldn’t have been a problem – if a procedural program fails to find the information it needs, it enters a conditional branch to handle this situation, as defined by the programmer. In a declarative, set-based language such as SQL,  this was not possible. The alternatives were either to have the SQL developer add conditional expressions for each nullable column in a query to handle missing data, or to define a decent default behavior in SQL for missing data so that developers only have to write explicit conditional expressions if they need to override the default behavior.

 

The default NULL handling in expressions is very much based on how humans would handle similar situations. For instance, stop and think a moment what you would reply if I asked you to calculate my age, and to mimic database behavior, I’d also constrain your answer to be within the domain of integer values. You’d probably first ask me for my birthday. But if I refuse to specify my birthday, you would be unable to answer the question – so you wouldn’t answer it at all. And this is exactly what a database does – if any value to be used in an expression is missing (NULL) , there won’t be any result of the expression; in other words, the result is missing (NULL) as well. This is known as the rule of NULL propagation: any expression in SQL returns NULL if any of its input arguments is NULL (with the notable exception of a few functions that are specifically conceived for NULL handling). And that is a second similarity between NULL and a black hole: the gravity of a black hole pulls in everything that gets too close, causing it to “disappear” in the hole, and the rule of NULL propagation causes any expression that has a NULL in it to have no result, as if the other input values disappear as well.

 

Just to confuse matters more, default handling of NULL in a predicate is different. There is a valid reason for this, though. First of all, expressions are supposed to return a value that adheres to the rule of a datatype, but “Hey dude, I can’t answer that” is obviously not a valid value in any numeric, date, time, or datetime domain. It would of course be valid in a domain for character strings, but you can’t just overload what might already be a valid “regular” value with a special meaning. NULL however is valid, in any domain. So there really was no other choice but to return NULL if input data for an expression is missing. The reason that predicates can’t be handled the same way, is that a predicate is typically used in a WHERE expression, so there has to be a result for each row – even if some input data is missing, the DBMS still has to decide whether or not to include the row in the output. This problem was also solved by mimicking what humans would do. Suppose I’d ask you to tell me whether I am older than thirty-five. This time, I won’t restrict your answer to any domain, but I do force you to give an answer. And of course, I still refuse to disclose my date of birth. So, since you obviously can’t say “yes” or “no” without having a 50% chance of being wrong, what will your answer be? Probably some more or less polite variation of “bugger off, dude, how am I supposed to know that if you don’t disclose your birthday?” Translated back to database terms, you answer would not be true or false, but unknown. Not unknown because some input data is unknown (as some text books write), but unknown because some input data is missing.

 

This choice solves the problem, since we are now able to evaluate each predicate in a WHERE clause for each row, with the result being either true (row is included), false (row is omitted), or unknown (in which case the row is omitted as well). In a future post, I will cover how this affects logical expressions that involve AND, OR, or NOT to modify predicates, and how unknown is sometimes treated the same as false, yet other times treated the same as true, depending on the context of the predicate.

The Bounding Box, corrected version
The logic of three-valued logic

Related Posts

No results found.

21 Comments. Leave new

  • James Luetkehoelter
    July 6, 2007 16:33

    I find this a very interesting discussion of NULL. Well done. I especially like pointing out the difference between "Birthday" and "Why Birthday is missing" as being separate attributes. That being said, I’m not sure I agree. I’m still chewing on this, but I love angle you’ve taken (catchy title too 🙂 ).

    Reply
  • Denis Gobo
    July 6, 2007 16:45

    Talking about Birthdays, here is something interesting: Birthday paradox

    In probability theory, the birthday paradox states that in a group of 23 (or more) randomly chosen people, there is more than 50% probability that some pair of them will have the same birthday. For 57 or more people, the probability is more than 99%

    http://en.wikipedia.org/wiki/Birthday_paradox

    Reply
  • Hugo Kornelis
    July 6, 2007 17:15

    Thanks for your comments, all.

    James – if your chewing ends in you still disagreeing with the seperation of the two attributes, then I’d love to hear (read) your arguments for that position!

    Reply
  • James Luetkehoelter
    July 7, 2007 18:00

    Actually the chewing is more overall – I would agree with the separation of attributes. After chewing on it (resulting in very little sleep and dreams about playing poker with Codd, Date and Larry Ellison – we ended up throwing Ellison out of the game), I believe I have my finger on what gave me pause. It results in me babbling significantly to explain, so I’m going to respond fully in separate post.

    I still really, really your discussion of things. You took theory and moved it to real world very nicely.

    Reply
  • Hugo Kornelis
    July 8, 2007 19:41

    Hi James,

    Next time you find yourself in a poker game with Codd, Date, and Ellison, throw them all and keep their chips for yourself.

    I’m really looking forward to your post. Maybe I ought to say that I’m sorry for making you lose sleep, but fact is, I am not – it’s always a good thing to get people to think!

    Reply
  • James Luetkehoelter
    July 9, 2007 18:28

    Agreed! I think there’s nothing better than discussing something controversial. Wonderful, I have a fellow argue-r to talk to! Post will be up this afternoon(for me).

    Reply
  • David Portas
    July 9, 2007 19:31

    Hugo, your Customers table certainly isn’t in 2NF.

    A functional dependecy x -> y over r must satisfy that x = y forall tuples in r  – not the case if the table contains nulls. (For example see "Theory of R. DB." (Maier) p42 for the formal definitions and some important consequences).

    Normalization of optional attributes can be done by decomposition. Fairly obviously, the Birthday and ReasonNoBirthday attributes belong in separate tables since they refer to different entities – "Customer with birthday" and "Customer without birthday".

    Reply
  • David Portas
    July 9, 2007 19:38

    Oops! Obviously the second sentence I wrote was rubbish! The requirement is that an FD satisfies the predicate T1{x} = T1{y} implies T2{x} = T2{y} for every pair of tuples T1 and T2. Therefore, no such FD is satisfied if either x or y are null.

    Reply
  • David Portas
    July 9, 2007 19:55

    Grrr! I’ll try one more time: T1{x} = T2{x} implies T1{y} = T2{y}. But don’t take my word for it 🙂

    Reply
  • Hugo Kornelis
    July 9, 2007 20:29

    Hi David,

    Thanks for your comments. However, I don’t think I can agree. I had to do some googling in order to find formal definitions for 2NF, since I am used to working with a modeling method that skips normalisation and takes the model straight into 5NF. I found some contradicting ways to formulate the requirements for 2NF, but no real disagreements.

    I’ll go with the definition on Wikipedia: "a 1NF table is in 2NF if and only if, given any candidate key and any attribute that is not a constituent of a candidate key, the non-key attribute depends upon the whole of the candidate key rather than just a part of it".

    I did not google for a definition of functional dependency, since I know that Y is functionally dependent on X if for any given value of X, there is at most one value of Y.

    By this definition of functional dependency, both Birthday and ReasonNoBirthday are definitely functional dependent on CustomerID, since for each CustomerID, there will be either exacly one Birthday or none at all (NULL), and exactly one ReasonNoBirthday or none at all (NULL).

    The Customers table has one candidate key: CustomerID. It has two non-key columns: Birthday and ReasonNoBirthday. Both are functionally dependent on the candidate key, ergo the table is in second normal form.

    Unless I am really being led astray as to the 2NF definition or as to the definition of functional dependency, but I’m sure you’ll tell me if I am!

    Reply
  • Alex Kuznetsov
    July 10, 2007 17:04

    Hey Hugo,

    I think you are making excellent points, I loved your post! Just wanted to note that I think that Birthday is functionally dependent on both CustomerID and ReasonNoBirthday . If ReasonNoBirthday is not null, then Birthday must be NULL. In some cases ReasonNoBirthday is also functionally dependent on both CustomerID and Birthday – it is when you must specify ReasonNoBirthday when you omit Birthday. However, I think in some cases you can omit both Birthday and ReasonNoBirthday – it depends on your business rules. What do you think?

    Reply
  • Hugo Kornelis
    July 10, 2007 23:57

    Hi Alex,

    Birthday would be functionally dependent on both CustomerID and ReasonNoBirthday if you have to specify both before you can tell for sure what (if any) the birthday is. That is clearly not the case – given a CustomerID, I can tell you without a trace of a doubt (assuming I trust the contents of my DB) whether a birthday is stored for this customer and what it is.

    An example of functional dependency on more than one attribute would be the AmountOrdered being dependant on the combination of OrderID and OrderLineNo. If you only know the OrderID, you can’t tell me the AmountOrdered (unless you’re lucky and it’s a one-line order), and neither can you tell me the AmountOrdered if I supply OrderLineNo but not OrderID. You really need both before you can tell me the AmountOrdered.

    Sure, there is a relationship between Birthday and ReasonNoBirthday (as captured in the CHECK constraint), but it’s not a relationship of functional dependancy.

    Reply
  • The quote "any expression in SQL returns NULL if any of its input arguments is NULL (with the notable exception of a few functions that are specifically conceived for NULL handling)." appears a bit misleading? AFAIK sum, min, max, and other statistical functions were Not "specifically conceived for NULL handling": Consider SELECT

    SUM(NUMERIC_SCALE) AS SumIsNullInsensitive

    ,STDEV(NUMERIC_SCALE) AS STDEVIsNullInsensitive

    ,MIN(NUMERIC_SCALE) AS MinIsNullInsensitive

    ,Max(NUMERIC_SCALE) AS MaxIsNullInsensitive

    FROM INFORMATION_SCHEMA.columns

    HTH

    Reply
  • Database in Depth
    July 17, 2007 21:47

    Suggested reading – http://www.oreilly.com/catalog/databaseid/

    Reply
  • Hugo Kornelis
    July 19, 2007 21:25

    Hi "sql",

    You make a fair point. I admit that I forgot to include aggregate functions in my post.

    Since aggregate functions are quite simple, I won’t write a seperate post on them. Their treatment of NULLs is very simple – they just dsiregard them. So if you have a table with three rows, with values 10, 90, and NULL in an integer column, then the results of MIN(column), MAX(column), SUM(column), AVG(column), and COUNT(column) will be 10, 90, 100, 50, and 2 – just as if the row with the NULL value doesn’t exist at all.

    If the elimitanion of NULL values results in no rows being left, COUNT(..) will return 0 and all other aggregate functions will return NULL (just as when they are run on an empty table).

    Reply
  • Hugo Kornelis
    October 1, 2007 09:17

    The fourth (and final) part of this series about  NULL is now available at http://sqlblog.com/blogs/hugo_kornelis/archive/2007/09/30/what-if-null-if-null-is-null-null-null-is-null.aspx.

    Reply
  • Phillip Hamlyn
    January 4, 2008 19:19

    Returning to the Dates of Birth – I’ve taken to seperating Day, Month and Year into seperate fields in SQL Server instead of treating Date Of Birth as a Date. Why ? My analysis has shown that in most systems containing public or customer data that a higher than expected proportion of people are being recorded as being born on the 1st of any specific month (5-10% more than statistically probable, depending on the type of system). Unless its absolutely required otherwise I’ve got into the habit of making this three nullable fields – this means that when someone simply gives their age, (or possibly guesses at their own date of birth – common for those that have no cerain birth record) we can record without forced inaccuracy, exactly the information given. A simple calculated field will give an implied date of birth if neccessary, but it prevents the operator being forced to enter inaccurate data just because the data type demands it.

    Reply
  • Gints Plivna
    March 26, 2008 15:37

    I’ve taken to seperating Day, Month and Year into seperate fields in SQL Server

    instead of treating Date Of Birth as a Date.

    I usually use in such cases one birth date and 2 flags:

    UnknownDate (T, F)

    UnknownMonth (T, F)

    and of course UnknownMonth=T implies UnknownDate=T

    Reply
  • plz tell me its true or false:

    true expression or null expression will result in NULL

    Reply
  • Did I just read an episode of Big Bang Theory? Geez you guys are smart.  I just wanted to know the difference between NULL and an empty field.  

    Reply
  • i understand it a litel bit, but i think tht i will understand it if you’ll give me some examples

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close